1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare

4
00:00:07,520 --> 00:00:11,610
continue to offer high-quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:21,807 --> 00:00:23,390
JULIAN SHUN: Good
afternoon, everyone.

9
00:00:23,390 --> 00:00:26,012
So let's get started.

10
00:00:26,012 --> 00:00:27,470
So today, we're
going to be talking

11
00:00:27,470 --> 00:00:31,130
about races and parallelism.

12
00:00:31,130 --> 00:00:34,310
And you'll be doing a lot
of parallel programming

13
00:00:34,310 --> 00:00:38,173
for the next homework
assignment and project.

14
00:00:38,173 --> 00:00:40,340
One thing I want to point
out is that it's important

15
00:00:40,340 --> 00:00:43,800
to meet with your MITPOSSE
as soon as possible,

16
00:00:43,800 --> 00:00:46,310
if you haven't done so
already, since that's

17
00:00:46,310 --> 00:00:49,430
going to be part of the
evaluation for the Project 1

18
00:00:49,430 --> 00:00:50,400
grade.

19
00:00:50,400 --> 00:00:53,900
And if you have trouble
reaching your MITPOSSE members,

20
00:00:53,900 --> 00:00:57,087
please contact your TA and
also make a post on Piazza

21
00:00:57,087 --> 00:00:57,920
as soon as possible.

22
00:01:00,730 --> 00:01:05,209
So as a reminder, let's
look at the basics of Cilk.

23
00:01:05,209 --> 00:01:09,350
So we have cilk_spawn
and cilk_sync statements.

24
00:01:09,350 --> 00:01:12,080
In Cilk, this was
the code that we

25
00:01:12,080 --> 00:01:14,690
saw in last lecture, which
computes the nth Fibonacci

26
00:01:14,690 --> 00:01:16,380
number.

27
00:01:16,380 --> 00:01:20,510
So when we say
cilk_spawn, it means

28
00:01:20,510 --> 00:01:23,540
that the named child
function, the function right

29
00:01:23,540 --> 00:01:26,450
after the cilk_spawn keyword,
can execute in parallel

30
00:01:26,450 --> 00:01:28,010
with the parent caller.

31
00:01:28,010 --> 00:01:29,960
So it says that fib
of n minus 1 can

32
00:01:29,960 --> 00:01:35,420
execute in parallel with the
fib function that called it.

33
00:01:35,420 --> 00:01:39,620
And then cilk_sync says that
control cannot pass this point

34
00:01:39,620 --> 00:01:42,870
until all of this spawned
children have returned.

35
00:01:42,870 --> 00:01:45,920
So this is going to wait
for fib of n minus 1

36
00:01:45,920 --> 00:01:53,240
to finish before it goes on
and returns the sum of x and y.

37
00:01:53,240 --> 00:01:55,880
And recall that the Cilk
keywords grant permission

38
00:01:55,880 --> 00:01:58,280
for parallel execution,
but they don't actually

39
00:01:58,280 --> 00:01:59,660
force parallel execution.

40
00:01:59,660 --> 00:02:03,800
So this code here says that we
can execute fib of n minus 1

41
00:02:03,800 --> 00:02:06,208
in parallel with
this parent caller,

42
00:02:06,208 --> 00:02:08,000
but it doesn't say that
we necessarily have

43
00:02:08,000 --> 00:02:10,310
to execute them in parallel.

44
00:02:10,310 --> 00:02:12,380
And it's up to
the runtime system

45
00:02:12,380 --> 00:02:16,040
to decide whether these
different functions will

46
00:02:16,040 --> 00:02:17,120
be executed in parallel.

47
00:02:17,120 --> 00:02:21,980
We'll talk more about
the runtime system today.

48
00:02:21,980 --> 00:02:25,130
And also, we talked
about this example,

49
00:02:25,130 --> 00:02:28,310
where we wanted to do an
in-place matrix transpose.

50
00:02:28,310 --> 00:02:32,210
And this used the
cilk_for keyword.

51
00:02:32,210 --> 00:02:34,100
And this says that
we can execute

52
00:02:34,100 --> 00:02:39,260
the iterations of this
cilk_for loop in parallel.

53
00:02:39,260 --> 00:02:42,140
And again, this says
that the runtime system

54
00:02:42,140 --> 00:02:44,348
is allowed to schedule these
iterations in parallel,

55
00:02:44,348 --> 00:02:45,890
but doesn't necessarily
say that they

56
00:02:45,890 --> 00:02:49,940
have to execute in parallel.

57
00:02:49,940 --> 00:02:53,690
And under the hood,
cilk_for statements

58
00:02:53,690 --> 00:02:58,620
are translated into nested
cilk_spawn and cilk_sync calls.

59
00:02:58,620 --> 00:03:02,540
So the compiler is going
to divide the iteration

60
00:03:02,540 --> 00:03:06,690
space in half, do a cilk_spawn
on one of the two halves,

61
00:03:06,690 --> 00:03:08,750
call the other
half, and then this

62
00:03:08,750 --> 00:03:12,200
is done recursively
until we reach

63
00:03:12,200 --> 00:03:14,420
a certain size for the
number of iterations

64
00:03:14,420 --> 00:03:16,190
in a loop, at
which point it just

65
00:03:16,190 --> 00:03:19,730
creates a single task for that.

66
00:03:19,730 --> 00:03:22,880
So any questions on
the Cilk constructs?

67
00:03:22,880 --> 00:03:23,650
Yes?

68
00:03:23,650 --> 00:03:27,680
AUDIENCE: Is Cilk smart
enough to recognize issues

69
00:03:27,680 --> 00:03:30,985
with reading and writing
for matrix transpose?

70
00:03:30,985 --> 00:03:32,360
JULIAN SHUN: So
it's actually not

71
00:03:32,360 --> 00:03:36,950
going to figure out
whether the iterations are

72
00:03:36,950 --> 00:03:37,840
independent for you.

73
00:03:37,840 --> 00:03:40,910
The programmer actually
has to reason about that.

74
00:03:40,910 --> 00:03:44,090
But Cilk does have a nice
tool, which we'll talk about,

75
00:03:44,090 --> 00:03:47,690
that will tell you which
places your code might possibly

76
00:03:47,690 --> 00:03:50,540
be reading and writing
the same memory location,

77
00:03:50,540 --> 00:03:53,640
and that allows you to
localize any possible race

78
00:03:53,640 --> 00:03:54,390
bugs in your code.

79
00:03:54,390 --> 00:03:57,020
So we'll actually
talk about races.

80
00:03:57,020 --> 00:03:58,710
But if you just
compile this code,

81
00:03:58,710 --> 00:04:03,462
Cilk isn't going to know whether
the iterations are independent.

82
00:04:07,530 --> 00:04:13,000
So determinacy races--
so race conditions

83
00:04:13,000 --> 00:04:15,020
are the bane of concurrency.

84
00:04:15,020 --> 00:04:18,670
So you don't want to have
race conditions in your code.

85
00:04:18,670 --> 00:04:23,480
And there are these two famous
race bugs that cause disaster.

86
00:04:23,480 --> 00:04:27,850
So there is this Therac-25
radiation therapy machine,

87
00:04:27,850 --> 00:04:30,650
and there was a race
condition in the software.

88
00:04:30,650 --> 00:04:32,610
And this led to three
people being killed

89
00:04:32,610 --> 00:04:36,100
and many more being
seriously injured.

90
00:04:36,100 --> 00:04:39,040
The North American
blackout of 2003

91
00:04:39,040 --> 00:04:41,530
was also caused by a
race bug in the software,

92
00:04:41,530 --> 00:04:45,110
and this left 50 million
people without power.

93
00:04:45,110 --> 00:04:47,050
So these are very bad.

94
00:04:47,050 --> 00:04:49,450
And they're notoriously
difficult to discover

95
00:04:49,450 --> 00:04:50,650
by conventional testing.

96
00:04:50,650 --> 00:04:52,870
So race bugs aren't going
to appear every time

97
00:04:52,870 --> 00:04:54,405
you execute your program.

98
00:04:54,405 --> 00:04:59,980
And in fact, the hardest ones to
find, which cause these events,

99
00:04:59,980 --> 00:05:01,712
are actually very rare events.

100
00:05:01,712 --> 00:05:03,670
So most of the times when
you run your program,

101
00:05:03,670 --> 00:05:05,212
you're not going to
see the race bug.

102
00:05:05,212 --> 00:05:07,110
Only very rarely
will you see it.

103
00:05:07,110 --> 00:05:10,512
So this makes it very hard
to find these race bugs.

104
00:05:10,512 --> 00:05:12,220
And furthermore, when
you see a race bug,

105
00:05:12,220 --> 00:05:14,110
it doesn't necessarily
always happen

106
00:05:14,110 --> 00:05:15,662
in the same place in your code.

107
00:05:15,662 --> 00:05:16,870
So that makes it even harder.

108
00:05:19,490 --> 00:05:20,920
So what is a race?

109
00:05:20,920 --> 00:05:24,925
So a determinacy race is one of
the most basic forms of races.

110
00:05:24,925 --> 00:05:27,550
And a determinacy
race occurs when

111
00:05:27,550 --> 00:05:29,500
two logically
parallel instructions

112
00:05:29,500 --> 00:05:32,560
access the same memory
location, and at least one

113
00:05:32,560 --> 00:05:35,950
of these instructions performs
a write to that location.

114
00:05:35,950 --> 00:05:39,500
So let's look at
a simple example.

115
00:05:39,500 --> 00:05:43,030
So in this code here, I'm
first setting x equal to 0.

116
00:05:43,030 --> 00:05:45,790
And then I have a cilk_for
loop with two iterations,

117
00:05:45,790 --> 00:05:47,680
and each of the
two iterations are

118
00:05:47,680 --> 00:05:50,140
incrementing this variable x.

119
00:05:50,140 --> 00:05:55,090
And then at the end, I'm going
to assert that x is equal to 2.

120
00:05:55,090 --> 00:05:58,820
So there's actually a
race in this program here.

121
00:05:58,820 --> 00:06:01,540
So in order to understand
where the race occurs,

122
00:06:01,540 --> 00:06:05,230
let's look at the
execution graph here.

123
00:06:05,230 --> 00:06:08,200
So I'm going to label each of
these statements with a letter.

124
00:06:08,200 --> 00:06:12,940
The first statement, a, is
just setting x equal to 0.

125
00:06:12,940 --> 00:06:14,500
And then after
that, we're actually

126
00:06:14,500 --> 00:06:16,780
going to have two
parallel paths, because we

127
00:06:16,780 --> 00:06:19,060
have two iterations of
this cilk_for loop, which

128
00:06:19,060 --> 00:06:21,190
can execute in parallel.

129
00:06:21,190 --> 00:06:26,840
And each of these paths are
going to increment x by 1.

130
00:06:26,840 --> 00:06:30,850
And then finally, we're going
to assert that x is equal to 2

131
00:06:30,850 --> 00:06:33,010
at the end.

132
00:06:33,010 --> 00:06:36,310
And this sort of graph is
known as a dependency graph.

133
00:06:36,310 --> 00:06:38,620
It tells you what
instructions have

134
00:06:38,620 --> 00:06:41,360
to finish before you execute
the next instruction.

135
00:06:41,360 --> 00:06:43,840
So here it says
that B and C must

136
00:06:43,840 --> 00:06:46,013
wait for A to execute
before they proceed,

137
00:06:46,013 --> 00:06:48,430
but B and C can actually happen
in parallel, because there

138
00:06:48,430 --> 00:06:49,840
is no dependency among them.

139
00:06:49,840 --> 00:06:55,300
And then D has to happen
after B and C finish.

140
00:06:55,300 --> 00:06:57,940
So to understand why
there's a race bug here,

141
00:06:57,940 --> 00:07:00,190
we actually need to
take a closer look

142
00:07:00,190 --> 00:07:01,640
at this dependency graph.

143
00:07:01,640 --> 00:07:04,370
So let's take a closer look.

144
00:07:04,370 --> 00:07:08,620
So when you run this
code, x plus plus

145
00:07:08,620 --> 00:07:12,650
is actually going to be
translated into three steps.

146
00:07:12,650 --> 00:07:14,530
So first, we're going
to load the value

147
00:07:14,530 --> 00:07:19,030
of x into some
processor's register, r1.

148
00:07:19,030 --> 00:07:20,980
And then we're going
to increment r1,

149
00:07:20,980 --> 00:07:24,830
and then we're going to set
x equal to the result of r1.

150
00:07:24,830 --> 00:07:25,970
And the same thing for r2.

151
00:07:25,970 --> 00:07:30,160
We're going to load x into
register r2, increment r2,

152
00:07:30,160 --> 00:07:32,070
and then set x equal to r2.

153
00:07:35,620 --> 00:07:43,990
So here, we have a race,
because both of these stores,

154
00:07:43,990 --> 00:07:46,420
x1 equal to r1 and
x2 equal to r2,

155
00:07:46,420 --> 00:07:49,840
are actually writing to
the same memory location.

156
00:07:49,840 --> 00:07:53,710
So let's look at one possible
execution of this computation

157
00:07:53,710 --> 00:07:54,460
graph.

158
00:07:54,460 --> 00:07:58,195
And we're going to keep track
of the values of x, r1 and r2.

159
00:08:00,722 --> 00:08:02,680
So the first instruction
we're going to execute

160
00:08:02,680 --> 00:08:04,120
is x equal to 0.

161
00:08:04,120 --> 00:08:08,290
So we just set x equal to 0,
and everything's good so far.

162
00:08:08,290 --> 00:08:11,560
And then next, we can actually
pick one of two instructions

163
00:08:11,560 --> 00:08:15,610
to execute, because both
of these two instructions

164
00:08:15,610 --> 00:08:19,090
have their predecessors
satisfied already.

165
00:08:19,090 --> 00:08:20,900
Their predecessors
have already executed.

166
00:08:20,900 --> 00:08:26,090
So let's say I pick r1
equal to x to execute.

167
00:08:26,090 --> 00:08:31,070
And this is going to place
the value 0 into register r1.

168
00:08:31,070 --> 00:08:33,460
Now I'm going to
increment r1, so this

169
00:08:33,460 --> 00:08:36,940
changes the value in r1 to 1.

170
00:08:36,940 --> 00:08:41,140
Then now, let's say I
execute r2 equal to x.

171
00:08:41,140 --> 00:08:44,020
So that's going to read
x, which has a value of 0.

172
00:08:44,020 --> 00:08:46,700
It's going to place
the value of 0 into r2.

173
00:08:46,700 --> 00:08:48,550
It's going to increment r2.

174
00:08:48,550 --> 00:08:50,605
That's going to change
that value to 1.

175
00:08:50,605 --> 00:08:54,550
And then now, let's say
I write r2 back to x.

176
00:08:54,550 --> 00:08:58,460
So I'm going to place
a value of 1 into x.

177
00:08:58,460 --> 00:09:02,250
Then now, when I execute this
instruction, x1 equal to r1,

178
00:09:02,250 --> 00:09:06,460
it's also placing a
value of 1 into x.

179
00:09:06,460 --> 00:09:09,190
And then finally, when
I do the assertion,

180
00:09:09,190 --> 00:09:12,840
this value here is not equal
to 2, and that's wrong.

181
00:09:12,840 --> 00:09:14,590
Because if you executed
this sequentially,

182
00:09:14,590 --> 00:09:18,050
you would get a value of 2 here.

183
00:09:18,050 --> 00:09:20,530
And the reason-- as I said,
the reason why this occurs

184
00:09:20,530 --> 00:09:22,645
is because we have
multiple writes

185
00:09:22,645 --> 00:09:25,360
to the same shared
memory location, which

186
00:09:25,360 --> 00:09:27,910
could execute in parallel.

187
00:09:27,910 --> 00:09:32,020
And one of the nasty
things about this example

188
00:09:32,020 --> 00:09:34,850
here is that the race bug
doesn't necessarily always

189
00:09:34,850 --> 00:09:35,350
occur.

190
00:09:35,350 --> 00:09:38,800
So does anyone see why this
race bug doesn't necessarily

191
00:09:38,800 --> 00:09:39,850
always show up?

192
00:09:42,730 --> 00:09:43,595
Yes?

193
00:09:43,595 --> 00:09:46,515
AUDIENCE: [INAUDIBLE]

194
00:09:48,748 --> 00:09:49,540
JULIAN SHUN: Right.

195
00:09:49,540 --> 00:09:53,750
So the answer is because if
one of these two branches

196
00:09:53,750 --> 00:09:55,520
executes all three
of its instructions

197
00:09:55,520 --> 00:09:59,330
before we start the other one,
then the final result in x

198
00:09:59,330 --> 00:10:01,020
is going to be 2,
which is correct.

199
00:10:01,020 --> 00:10:03,650
So if I executed
these instructions

200
00:10:03,650 --> 00:10:08,690
in order of 1, 2, 3, 7, 4, 5, 6,
and then, finally, 8, the value

201
00:10:08,690 --> 00:10:11,960
is going to be 2 in x.

202
00:10:11,960 --> 00:10:15,960
So the race bug here doesn't
necessarily always occur.

203
00:10:15,960 --> 00:10:20,470
And this is one thing that
makes these bugs hard to find.

204
00:10:20,470 --> 00:10:21,500
So any questions?

205
00:10:30,030 --> 00:10:34,370
So there are two different
types of determinacy races.

206
00:10:34,370 --> 00:10:36,990
And they're shown
in this table here.

207
00:10:36,990 --> 00:10:40,010
So let's suppose that
instruction A and instruction

208
00:10:40,010 --> 00:10:44,660
B both access some location x,
and suppose A is parallel to B.

209
00:10:44,660 --> 00:10:48,720
So both of the instructions
can execute in parallel.

210
00:10:48,720 --> 00:10:51,440
So if A and B are just
reading that location,

211
00:10:51,440 --> 00:10:52,160
then that's fine.

212
00:10:52,160 --> 00:10:54,400
You don't actually
have a race here.

213
00:10:54,400 --> 00:10:56,270
But if one of the
two instructions

214
00:10:56,270 --> 00:10:59,150
is writing to that location,
whereas the other one is

215
00:10:59,150 --> 00:11:01,400
reading to that
location, then you

216
00:11:01,400 --> 00:11:03,320
have what's called a read race.

217
00:11:03,320 --> 00:11:06,950
And the program might have
a non-deterministic result

218
00:11:06,950 --> 00:11:09,800
when you have a read race,
because the final answer might

219
00:11:09,800 --> 00:11:13,250
depend on whether you
read A first before B

220
00:11:13,250 --> 00:11:16,820
updated the value, or
whether A read the updated

221
00:11:16,820 --> 00:11:19,680
value before B reads it.

222
00:11:19,680 --> 00:11:23,090
So the order of the
execution of A and B

223
00:11:23,090 --> 00:11:26,420
can affect the final
result that you see.

224
00:11:26,420 --> 00:11:28,340
And finally, if
both A and B write

225
00:11:28,340 --> 00:11:32,420
to the same shared location,
then you have a write race.

226
00:11:32,420 --> 00:11:35,030
And again, this will cause
non-deterministic behavior

227
00:11:35,030 --> 00:11:37,610
in your program, because the
final answer could depend on

228
00:11:37,610 --> 00:11:42,260
whether A did the write first
or B did the write first.

229
00:11:42,260 --> 00:11:44,180
And we say that two
sections of code

230
00:11:44,180 --> 00:11:49,200
are independent if there are no
determinacy races between them.

231
00:11:49,200 --> 00:11:52,040
So the two pieces of code
can't have a shared location,

232
00:11:52,040 --> 00:11:55,490
where one computation
writes to it

233
00:11:55,490 --> 00:11:58,160
and another computation
reads from it,

234
00:11:58,160 --> 00:12:03,200
or if both computations
write to that location.

235
00:12:03,200 --> 00:12:04,820
Any questions on the definition?

236
00:12:09,660 --> 00:12:12,810
So races are really bad,
and you should avoid

237
00:12:12,810 --> 00:12:16,590
having races in your program.

238
00:12:16,590 --> 00:12:19,060
So here are some tips
on how to avoid races.

239
00:12:19,060 --> 00:12:22,140
So I can tell you not to
write races in your program,

240
00:12:22,140 --> 00:12:25,073
and you know that races
are bad, but sometimes,

241
00:12:25,073 --> 00:12:26,490
when you're writing
code, you just

242
00:12:26,490 --> 00:12:28,740
have races in your program,
and you can't help it.

243
00:12:28,740 --> 00:12:33,270
But here are some tips on
how you can avoid races.

244
00:12:33,270 --> 00:12:36,733
So first, the iterations
of a cilk_for loop

245
00:12:36,733 --> 00:12:37,650
should be independent.

246
00:12:37,650 --> 00:12:40,380
So you should make sure that
the different iterations

247
00:12:40,380 --> 00:12:44,095
of a cilk_for loop aren't
writing to the same memory

248
00:12:44,095 --> 00:12:44,595
location.

249
00:12:47,310 --> 00:12:50,070
Secondly, between a
cilk_spawn statement

250
00:12:50,070 --> 00:12:53,820
and a corresponding cilk_sync,
the code of the spawn child

251
00:12:53,820 --> 00:12:57,150
should be independent of
the code of the parent.

252
00:12:57,150 --> 00:13:01,440
And this includes code that's
executed by additional spawned

253
00:13:01,440 --> 00:13:04,348
or called children
by the spawned child.

254
00:13:04,348 --> 00:13:06,390
So you should make sure
that these pieces of code

255
00:13:06,390 --> 00:13:08,040
are independent--
there's no read

256
00:13:08,040 --> 00:13:09,340
or write races between them.

257
00:13:12,370 --> 00:13:15,180
One thing to note is that the
arguments to a spawn function

258
00:13:15,180 --> 00:13:17,820
are evaluated in the parent
before the spawn actually

259
00:13:17,820 --> 00:13:18,320
occurs.

260
00:13:18,320 --> 00:13:21,510
So you can't get a race in
the argument evaluation,

261
00:13:21,510 --> 00:13:25,620
because the parent is going
to evaluate these arguments.

262
00:13:25,620 --> 00:13:29,470
And there's only one
thread that's doing this,

263
00:13:29,470 --> 00:13:32,100
so it's fine.

264
00:13:32,100 --> 00:13:35,490
And another thing to note
is that the machine word

265
00:13:35,490 --> 00:13:36,743
size matters.

266
00:13:36,743 --> 00:13:38,160
So you need to
watch out for races

267
00:13:38,160 --> 00:13:42,990
when you're reading and writing
to packed data structures.

268
00:13:42,990 --> 00:13:44,250
So here's an example.

269
00:13:44,250 --> 00:13:49,050
I have a struct x with
two chars, a and b.

270
00:13:49,050 --> 00:13:54,990
And updating x.a and x.b
may possibly cause a race.

271
00:13:54,990 --> 00:13:57,240
And this is a nasty
race, because it

272
00:13:57,240 --> 00:14:00,790
depends on the compiler
optimization level.

273
00:14:00,790 --> 00:14:02,758
Fortunately, this is safe
on the Intel machines

274
00:14:02,758 --> 00:14:04,050
that we're using in this class.

275
00:14:04,050 --> 00:14:06,450
You can't get a race
in this example.

276
00:14:06,450 --> 00:14:07,860
But there are
other architectures

277
00:14:07,860 --> 00:14:12,780
that might have a race when
you're updating the two

278
00:14:12,780 --> 00:14:15,138
variables a and b in this case.

279
00:14:15,138 --> 00:14:16,680
So with the Intel
machines that we're

280
00:14:16,680 --> 00:14:20,580
using, if you're using standard
data types like chars, shorts,

281
00:14:20,580 --> 00:14:25,560
ints, and longs inside a
struct, you won't get races.

282
00:14:25,560 --> 00:14:27,750
But if you're using
non-standard types--

283
00:14:27,750 --> 00:14:30,510
for example, you're using
the C bit fields facilities,

284
00:14:30,510 --> 00:14:35,220
and the sizes of the fields are
not one of the standard sizes,

285
00:14:35,220 --> 00:14:38,320
then you could
possibly get a race.

286
00:14:38,320 --> 00:14:42,900
In particular, if you're
updating individual bits

287
00:14:42,900 --> 00:14:47,370
inside a word in parallel, then
you might see a race there.

288
00:14:47,370 --> 00:14:48,510
So you need to be careful.

289
00:14:51,070 --> 00:14:52,155
Questions?

290
00:14:59,510 --> 00:15:02,290
So fortunately,
the Cilk platform

291
00:15:02,290 --> 00:15:04,690
has a very nice
tool called the--

292
00:15:04,690 --> 00:15:05,440
yes, question?

293
00:15:05,440 --> 00:15:09,970
AUDIENCE: [INAUDIBLE] was going
to ask, what causes that race?

294
00:15:09,970 --> 00:15:13,120
JULIAN SHUN: Because the
architecture might actually

295
00:15:13,120 --> 00:15:18,700
be updating this struct
at the granularity of more

296
00:15:18,700 --> 00:15:20,950
than 1 byte.

297
00:15:20,950 --> 00:15:25,440
So if you're updating single
bytes inside this larger word,

298
00:15:25,440 --> 00:15:27,670
then that might cause a race.

299
00:15:30,088 --> 00:15:32,380
But fortunately, this doesn't
happen on Intel machines.

300
00:15:35,140 --> 00:15:38,950
So the Cilksan race detector--

301
00:15:38,950 --> 00:15:41,380
if you compile your
code using this flag,

302
00:15:41,380 --> 00:15:45,820
minus f sanitize
equal to cilk, then

303
00:15:45,820 --> 00:15:49,300
it's going to generate a
Cilksan instrumentive program.

304
00:15:49,300 --> 00:15:53,950
And then if an ostensibly
deterministic Cilk program

305
00:15:53,950 --> 00:15:57,280
run on a given input could
possibly behave any differently

306
00:15:57,280 --> 00:16:00,250
than its serial
elision, then Cilksan

307
00:16:00,250 --> 00:16:02,800
is going to guarantee
to report and localize

308
00:16:02,800 --> 00:16:05,170
the offending race.

309
00:16:05,170 --> 00:16:08,770
So Cilksan is going to tell
you which memory location there

310
00:16:08,770 --> 00:16:12,250
might be a race on and
which of the instructions

311
00:16:12,250 --> 00:16:15,100
were involved in this race.

312
00:16:15,100 --> 00:16:17,710
So Cilksan employs a
regression test methodology

313
00:16:17,710 --> 00:16:20,740
where the programmer provides
it different test inputs.

314
00:16:20,740 --> 00:16:23,020
And for each test input,
if there could possibly

315
00:16:23,020 --> 00:16:28,630
be a race in the program, then
it will report these races.

316
00:16:28,630 --> 00:16:32,290
And it identifies the
file names, the lines,

317
00:16:32,290 --> 00:16:34,780
the variables
involved in the races,

318
00:16:34,780 --> 00:16:36,130
including the stack traces.

319
00:16:36,130 --> 00:16:39,430
So it's very helpful when
you're trying to debug your code

320
00:16:39,430 --> 00:16:43,930
and find out where there's
a race in your program.

321
00:16:43,930 --> 00:16:45,490
One thing to note
is that you should

322
00:16:45,490 --> 00:16:48,845
ensure that all of your
program files are instrumented.

323
00:16:48,845 --> 00:16:51,220
Because if you only instrument
some of your files and not

324
00:16:51,220 --> 00:16:53,830
the other ones, then
you'll possibly miss out

325
00:16:53,830 --> 00:16:55,240
on some of these race bugs.

326
00:16:58,510 --> 00:17:01,300
And one of the nice things
about the Cilksan race detector

327
00:17:01,300 --> 00:17:04,420
is that it's always going
to report a race if there

328
00:17:04,420 --> 00:17:08,660
is possibly a race, unlike many
other race detectors, which

329
00:17:08,660 --> 00:17:09,520
are best efforts.

330
00:17:09,520 --> 00:17:12,250
So they might report a
race some of the times

331
00:17:12,250 --> 00:17:14,650
when the race actually occurs,
but they don't necessarily

332
00:17:14,650 --> 00:17:15,790
report a race all the time.

333
00:17:15,790 --> 00:17:18,849
Because in some executions,
the race doesn't occur.

334
00:17:18,849 --> 00:17:20,950
But the Cilksan race
detector is going

335
00:17:20,950 --> 00:17:23,829
to always report the race,
if there is potentially

336
00:17:23,829 --> 00:17:24,550
a race in there.

337
00:17:28,520 --> 00:17:29,850
Cilksan is your best friend.

338
00:17:29,850 --> 00:17:33,720
So use this when you're
debugging your homeworks

339
00:17:33,720 --> 00:17:36,090
and projects.

340
00:17:36,090 --> 00:17:39,900
Here's an example of the output
that's generated by Cilksan.

341
00:17:39,900 --> 00:17:43,770
So you can see that it's saying
that there's a race detected

342
00:17:43,770 --> 00:17:46,410
at this memory address here.

343
00:17:46,410 --> 00:17:51,300
And the line of code
that caused this race

344
00:17:51,300 --> 00:17:53,940
is shown here, as
well as the file name.

345
00:17:53,940 --> 00:17:56,860
So this is a matrix
multiplication example.

346
00:17:56,860 --> 00:17:59,110
And then it also tells you
how many races it detected.

347
00:18:04,540 --> 00:18:07,420
So any questions on
determinacy races?

348
00:18:16,630 --> 00:18:19,930
So let's now talk
about parallelism.

349
00:18:19,930 --> 00:18:21,190
So what is parallelism?

350
00:18:21,190 --> 00:18:25,717
Can we quantitatively
define what parallelism is?

351
00:18:25,717 --> 00:18:27,550
So what does it mean
when somebody tells you

352
00:18:27,550 --> 00:18:30,900
that their code is
highly parallel?

353
00:18:30,900 --> 00:18:34,390
So to have a formal
definition of parallelism,

354
00:18:34,390 --> 00:18:38,230
we first need to look at
the Cilk execution model.

355
00:18:38,230 --> 00:18:43,480
So this is a code that we
saw before for Fibonacci.

356
00:18:43,480 --> 00:18:49,670
Let's now look at what a
call to fib of 4 looks like.

357
00:18:49,670 --> 00:18:54,200
So here, I've color coded the
different lines of code here

358
00:18:54,200 --> 00:18:55,750
so that I can refer
to them when I'm

359
00:18:55,750 --> 00:18:58,480
drawing this computation graph.

360
00:18:58,480 --> 00:19:01,180
So now, I'm going to draw this
computation graph corresponding

361
00:19:01,180 --> 00:19:05,210
to how the computation
unfolds during execution.

362
00:19:05,210 --> 00:19:07,210
So the first thing
I'm going to do

363
00:19:07,210 --> 00:19:09,040
is I'm going to call fib of 4.

364
00:19:09,040 --> 00:19:11,920
And that's going to
generate this magenta node

365
00:19:11,920 --> 00:19:15,070
here corresponding to
the call to fib of 4,

366
00:19:15,070 --> 00:19:17,890
and that's going to represent
this pink code here.

367
00:19:20,740 --> 00:19:25,558
And this illustration is similar
to the computation graphs

368
00:19:25,558 --> 00:19:27,100
that you saw in the
previous lecture,

369
00:19:27,100 --> 00:19:29,560
but this is happening
in parallel.

370
00:19:29,560 --> 00:19:32,300
And I'm only labeling
the argument here,

371
00:19:32,300 --> 00:19:34,730
but you could actually also
write the local variables

372
00:19:34,730 --> 00:19:35,230
there.

373
00:19:35,230 --> 00:19:37,990
But I didn't do it, because
I want to fit everything

374
00:19:37,990 --> 00:19:38,740
on this slide.

375
00:19:42,220 --> 00:19:44,020
So what happens when
you call fib of 4?

376
00:19:44,020 --> 00:19:46,670
It's going to get to this
cilk_spawn statement,

377
00:19:46,670 --> 00:19:49,360
and then it's going
to call fib of 3.

378
00:19:49,360 --> 00:19:51,850
And when I get to a cilk_spawn
statement, what I do

379
00:19:51,850 --> 00:19:54,700
is I'm going to create
another node that corresponds

380
00:19:54,700 --> 00:19:57,640
to the child that I spawned.

381
00:19:57,640 --> 00:20:01,840
So this is this magenta
node here in this blue box.

382
00:20:01,840 --> 00:20:04,480
And then I also
have a continue edge

383
00:20:04,480 --> 00:20:07,240
going to a green node that
represents the computation

384
00:20:07,240 --> 00:20:08,810
after the cilk_spawn statement.

385
00:20:08,810 --> 00:20:12,400
So this green node here
corresponds to the green line

386
00:20:12,400 --> 00:20:14,260
of code in the code snippet.

387
00:20:18,040 --> 00:20:20,470
Now I can unfold this
computation graph

388
00:20:20,470 --> 00:20:22,150
one more step.

389
00:20:22,150 --> 00:20:25,130
So we see that fib 3 is
going to call fib of 2,

390
00:20:25,130 --> 00:20:27,400
so I created another node here.

391
00:20:27,400 --> 00:20:30,100
And the green node
here, which corresponds

392
00:20:30,100 --> 00:20:32,680
to this green line
of code-- it's

393
00:20:32,680 --> 00:20:34,270
also going to make
a function call.

394
00:20:34,270 --> 00:20:36,550
It's going to call fib of 2.

395
00:20:36,550 --> 00:20:40,190
And that's also going
to create a new node.

396
00:20:40,190 --> 00:20:42,370
So in general,
when I do a spawn,

397
00:20:42,370 --> 00:20:47,320
I'm going to have two outgoing
edges out of a magenta node.

398
00:20:47,320 --> 00:20:50,110
And when I do a call, I'm going
to have one outgoing edge out

399
00:20:50,110 --> 00:20:50,980
of a green node.

400
00:20:50,980 --> 00:20:53,950
So this green node,
the outgoing edge

401
00:20:53,950 --> 00:20:55,870
corresponds to a function call.

402
00:20:55,870 --> 00:20:59,410
And for this magenta node,
its first outgoing edge

403
00:20:59,410 --> 00:21:02,650
corresponds to spawn, and
then its second outgoing edge

404
00:21:02,650 --> 00:21:06,790
goes to the continuation strand.

405
00:21:06,790 --> 00:21:11,170
So I can unfold
this one more time.

406
00:21:11,170 --> 00:21:16,090
And here, I see that I'm
creating some more spawns

407
00:21:16,090 --> 00:21:17,680
and calls to fib.

408
00:21:17,680 --> 00:21:20,078
And if I do this
one more time, I've

409
00:21:20,078 --> 00:21:21,370
actually reached the base case.

410
00:21:21,370 --> 00:21:25,000
Because once n is
equal to 1 or 0,

411
00:21:25,000 --> 00:21:28,960
I'm not going to make
any more recursive calls.

412
00:21:28,960 --> 00:21:33,280
And by the way, the color of
these boxes that I used here

413
00:21:33,280 --> 00:21:35,530
correspond to whether
I called that function

414
00:21:35,530 --> 00:21:36,700
or whether I spawned it.

415
00:21:36,700 --> 00:21:40,180
So a box with white background
corresponds to a function

416
00:21:40,180 --> 00:21:43,347
that I called, whereas a
box with blue background

417
00:21:43,347 --> 00:21:45,055
corresponds to a
function that I spawned.

418
00:21:48,630 --> 00:21:53,290
So now I've gotten
to the base case,

419
00:21:53,290 --> 00:21:55,930
I need to now execute
this blue statement, which

420
00:21:55,930 --> 00:21:59,820
sums up x and y and returns the
result to the parent caller.

421
00:22:04,070 --> 00:22:06,920
So here I have a blue node.

422
00:22:06,920 --> 00:22:09,920
So this is going to take
the results of the two

423
00:22:09,920 --> 00:22:12,420
recursive calls,
sum them together.

424
00:22:12,420 --> 00:22:14,540
And I have another
blue node here.

425
00:22:14,540 --> 00:22:16,910
And then it's going
to pass its value

426
00:22:16,910 --> 00:22:18,860
to the parent that called it.

427
00:22:18,860 --> 00:22:22,880
So I'm going to pass
this up to its parent,

428
00:22:22,880 --> 00:22:25,740
and then I'm going to
pass this one up as well.

429
00:22:25,740 --> 00:22:29,480
And finally, I have a blue
node at the top level, which

430
00:22:29,480 --> 00:22:31,083
is going to compute
my final result,

431
00:22:31,083 --> 00:22:33,125
and that's going to be
the output of the program.

432
00:22:36,810 --> 00:22:41,760
So one thing to note is
that this computation dag

433
00:22:41,760 --> 00:22:44,240
unfolds dynamically
during the execution.

434
00:22:44,240 --> 00:22:46,860
So the runtime
system isn't going

435
00:22:46,860 --> 00:22:48,930
to create this graph
at the beginning.

436
00:22:48,930 --> 00:22:51,570
It's actually going to
create this on the fly

437
00:22:51,570 --> 00:22:53,580
as you run the program.

438
00:22:53,580 --> 00:22:58,650
So this graph here
unfolds dynamically.

439
00:22:58,650 --> 00:23:01,500
And also, this graph here
is processor-oblivious.

440
00:23:01,500 --> 00:23:03,990
So nowhere in this
computation dag

441
00:23:03,990 --> 00:23:06,960
did I mention the
number of processors

442
00:23:06,960 --> 00:23:08,610
I had for the computation.

443
00:23:08,610 --> 00:23:10,860
And similarly, in the
code here, I never

444
00:23:10,860 --> 00:23:13,347
mentioned the number of
processors that I'm using.

445
00:23:13,347 --> 00:23:15,180
So the runtime system
is going to figure out

446
00:23:15,180 --> 00:23:18,060
how to map these tasks to
the number of processors

447
00:23:18,060 --> 00:23:21,932
that you give to the computation
dynamically at runtime.

448
00:23:21,932 --> 00:23:24,390
So for example, I can run this
on any number of processors.

449
00:23:24,390 --> 00:23:26,520
If I run it on one
processor, it's

450
00:23:26,520 --> 00:23:28,782
just going to execute
these tasks in parallel.

451
00:23:28,782 --> 00:23:30,240
In fact, it's going
to execute them

452
00:23:30,240 --> 00:23:33,520
in a depth-first order,
which corresponds to the what

453
00:23:33,520 --> 00:23:35,610
the sequential
algorithm would do.

454
00:23:35,610 --> 00:23:40,320
So I'm going to start with fib
of 4, go to fib of 3, fib of 2,

455
00:23:40,320 --> 00:23:43,680
fib of 1, and go pop back
up and then do fib of 0

456
00:23:43,680 --> 00:23:44,890
and go back up and so on.

457
00:23:44,890 --> 00:23:49,200
So if I use one
processor, it's going

458
00:23:49,200 --> 00:23:51,150
to create and execute
this computation

459
00:23:51,150 --> 00:23:52,750
dag in the depth-first manner.

460
00:23:52,750 --> 00:23:55,765
And if I have more
than one processor,

461
00:23:55,765 --> 00:23:58,140
it's not necessarily going to
follow a depth-first order,

462
00:23:58,140 --> 00:24:00,630
because I could have multiple
computations going on.

463
00:24:05,640 --> 00:24:08,350
Any questions on this example?

464
00:24:08,350 --> 00:24:10,920
I'm actually going to
formally define some terms

465
00:24:10,920 --> 00:24:14,370
on the next slide so that
we can formalize the notion

466
00:24:14,370 --> 00:24:17,340
of a computation dag.

467
00:24:17,340 --> 00:24:19,650
So dag stands for
directed acyclic graph,

468
00:24:19,650 --> 00:24:21,660
and this is a directed
acyclic graph.

469
00:24:21,660 --> 00:24:24,780
So we call it a computation dag.

470
00:24:24,780 --> 00:24:27,210
So a parallel
instruction stream is

471
00:24:27,210 --> 00:24:31,830
a dag G with vertices
V and edges E.

472
00:24:31,830 --> 00:24:36,000
And each vertex in this dag
corresponds to a strand.

473
00:24:36,000 --> 00:24:38,940
And a strand is a
sequence of instructions

474
00:24:38,940 --> 00:24:42,420
not containing a spawn, a
sync, or a return from a spawn.

475
00:24:42,420 --> 00:24:44,910
So the instructions
inside a strand

476
00:24:44,910 --> 00:24:46,590
are executed sequentially.

477
00:24:46,590 --> 00:24:49,800
There's no parallelism
within a strand.

478
00:24:49,800 --> 00:24:52,830
We call the first strand
the initial strand,

479
00:24:52,830 --> 00:24:56,193
so this is the
magenta node up here.

480
00:24:56,193 --> 00:24:58,110
The last strand-- we
call it the final strand.

481
00:24:58,110 --> 00:25:02,050
And then everything else,
we just call it a strand.

482
00:25:02,050 --> 00:25:05,010
And then there are
four types of edges.

483
00:25:05,010 --> 00:25:08,010
So there are spawn edges,
call edges, return edges,

484
00:25:08,010 --> 00:25:09,890
or continue edges.

485
00:25:09,890 --> 00:25:14,460
And a spawn edge corresponds
to an edge to a function

486
00:25:14,460 --> 00:25:16,420
that you spawned.

487
00:25:16,420 --> 00:25:22,670
So these spawn edges are
going to go to a magenta node.

488
00:25:22,670 --> 00:25:25,590
A call edge corresponds to an
edge that goes to a function

489
00:25:25,590 --> 00:25:27,330
that you called.

490
00:25:27,330 --> 00:25:30,660
So in this example, these are
coming out of the green nodes

491
00:25:30,660 --> 00:25:35,425
and going to a magenta node.

492
00:25:35,425 --> 00:25:38,520
A return edge corresponds
to an edge going back up

493
00:25:38,520 --> 00:25:40,320
to the parent caller.

494
00:25:40,320 --> 00:25:44,970
So here, it's going into
one of these blue nodes.

495
00:25:44,970 --> 00:25:49,020
And then finally, a continue
edge is just the other edge

496
00:25:49,020 --> 00:25:50,140
when you spawn a function.

497
00:25:50,140 --> 00:25:52,170
So this is the edge that
goes to the green node.

498
00:25:52,170 --> 00:25:55,020
It's representing
the computation

499
00:25:55,020 --> 00:25:56,793
after you spawn something.

500
00:26:00,420 --> 00:26:03,420
And notice that in
this computation dag,

501
00:26:03,420 --> 00:26:06,090
we never explicitly
represented cilk_for,

502
00:26:06,090 --> 00:26:07,950
because as I said
before, cilk_fors

503
00:26:07,950 --> 00:26:11,370
are converted to
nested cilk_spawns

504
00:26:11,370 --> 00:26:12,510
and cilk_sync statements.

505
00:26:12,510 --> 00:26:15,780
So we don't actually need to
explicitly represent cilk_fors

506
00:26:15,780 --> 00:26:16,920
in the computation DAG.

507
00:26:20,080 --> 00:26:22,638
Any questions on
this definition?

508
00:26:22,638 --> 00:26:24,430
So we're going to be
using this computation

509
00:26:24,430 --> 00:26:27,550
dag throughout this lecture to
analyze how much parallelism

510
00:26:27,550 --> 00:26:28,775
there is in a program.

511
00:26:39,070 --> 00:26:44,463
So assuming that each of these
strands executes in unit time--

512
00:26:44,463 --> 00:26:46,380
this assumption isn't
always true in practice.

513
00:26:46,380 --> 00:26:48,880
In practice, strands will take
different amounts of time.

514
00:26:48,880 --> 00:26:50,470
But let's assume,
for simplicity,

515
00:26:50,470 --> 00:26:53,740
that each strand
here takes unit time.

516
00:26:53,740 --> 00:26:55,960
Does anyone want to guess
what the parallelism

517
00:26:55,960 --> 00:26:57,100
of this computation is?

518
00:27:04,100 --> 00:27:06,170
So how parallel do
you think this is?

519
00:27:06,170 --> 00:27:09,760
What's the maximum speedup you
might get on this computation?

520
00:27:09,760 --> 00:27:10,935
AUDIENCE: 5.

521
00:27:10,935 --> 00:27:11,560
JULIAN SHUN: 5.

522
00:27:11,560 --> 00:27:12,880
Somebody said 5.

523
00:27:12,880 --> 00:27:14,920
Any other guesses?

524
00:27:14,920 --> 00:27:17,540
Who thinks this is going
to be less than five?

525
00:27:20,490 --> 00:27:21,698
A couple people.

526
00:27:21,698 --> 00:27:23,490
Who thinks it's going
to be more than five?

527
00:27:26,478 --> 00:27:28,383
A couple of people.

528
00:27:28,383 --> 00:27:29,800
Who thinks there's
any parallelism

529
00:27:29,800 --> 00:27:31,485
at all in this computation?

530
00:27:36,040 --> 00:27:39,190
Yeah, seems like a lot of people
think there is some parallelism

531
00:27:39,190 --> 00:27:40,078
here.

532
00:27:40,078 --> 00:27:42,370
So we're actually going to
analyze how much parallelism

533
00:27:42,370 --> 00:27:43,897
is in this computation.

534
00:27:43,897 --> 00:27:45,730
So I'm not going to
tell you the answer now,

535
00:27:45,730 --> 00:27:49,300
but I'll tell you in
a couple of slides.

536
00:27:49,300 --> 00:27:53,170
First need to go over
some terminology.

537
00:27:53,170 --> 00:27:55,930
So whenever you start
talking about parallelism,

538
00:27:55,930 --> 00:28:00,250
somebody is almost always
going to bring up Amdahl's Law.

539
00:28:00,250 --> 00:28:04,930
And Amdahl's Law says that
if 50% of your application

540
00:28:04,930 --> 00:28:08,410
is parallel and the
other 50% is serial,

541
00:28:08,410 --> 00:28:11,980
then you can't get more
than a factor of 2 speedup,

542
00:28:11,980 --> 00:28:16,600
no matter how many processors
you run the computation on.

543
00:28:16,600 --> 00:28:19,350
Does anyone know why
this is the case?

544
00:28:22,320 --> 00:28:22,920
Yes?

545
00:28:22,920 --> 00:28:25,395
AUDIENCE: Because you need it
to execute for at least 50%

546
00:28:25,395 --> 00:28:27,870
of the time in order to get
through the serial portion.

547
00:28:27,870 --> 00:28:28,662
JULIAN SHUN: Right.

548
00:28:28,662 --> 00:28:30,960
So you have to
spend at least 50%

549
00:28:30,960 --> 00:28:33,000
of the time in the
serial portion.

550
00:28:33,000 --> 00:28:35,820
So in the best
case, if I gave you

551
00:28:35,820 --> 00:28:37,200
an infinite number
of processors,

552
00:28:37,200 --> 00:28:40,560
and you can reduce the
parallel portion of your code

553
00:28:40,560 --> 00:28:43,920
to 0 running time, you still
have the 50% of the serial time

554
00:28:43,920 --> 00:28:45,540
that you have to execute.

555
00:28:45,540 --> 00:28:51,390
And therefore, the best speedup
you can get is a factor of 2.

556
00:28:51,390 --> 00:28:55,260
And in general, if a fraction
alpha of an application

557
00:28:55,260 --> 00:28:59,130
must be run serially, then
the speedup can be at most 1

558
00:28:59,130 --> 00:28:59,950
over alpha.

559
00:28:59,950 --> 00:29:04,500
So if 1/3 of your program has
to be executed sequentially,

560
00:29:04,500 --> 00:29:06,480
then the speedup
can be, at most, 3.

561
00:29:06,480 --> 00:29:10,800
Because even if you reduce the
parallel portion of your code

562
00:29:10,800 --> 00:29:13,620
to tab a running
time of 0, you still

563
00:29:13,620 --> 00:29:16,320
have the sequential part of your
code that you have to wait for.

564
00:29:21,380 --> 00:29:25,790
So let's try to quantify the
parallelism in this computation

565
00:29:25,790 --> 00:29:26,600
here.

566
00:29:26,600 --> 00:29:30,620
So how many of these nodes have
to be executed sequentially?

567
00:29:40,710 --> 00:29:41,220
Yes?

568
00:29:41,220 --> 00:29:43,740
AUDIENCE: 9 of them.

569
00:29:43,740 --> 00:29:46,140
JULIAN SHUN: So it turns
out to be less than 9.

570
00:29:53,288 --> 00:29:53,788
Yes?

571
00:29:53,788 --> 00:29:55,215
AUDIENCE: 7.

572
00:29:55,215 --> 00:29:55,840
JULIAN SHUN: 7.

573
00:29:55,840 --> 00:29:57,670
It turns out to be less than 7.

574
00:30:02,472 --> 00:30:02,972
Yes?

575
00:30:02,972 --> 00:30:03,752
AUDIENCE: 6.

576
00:30:03,752 --> 00:30:05,710
JULIAN SHUN: So it turns
out to be less than 6.

577
00:30:09,407 --> 00:30:10,702
AUDIENCE: 4.

578
00:30:10,702 --> 00:30:12,410
JULIAN SHUN: Turns
out to be less than 4.

579
00:30:12,410 --> 00:30:14,750
You're getting close.

580
00:30:14,750 --> 00:30:16,250
AUDIENCE: 2.

581
00:30:16,250 --> 00:30:17,660
JULIAN SHUN: 2.

582
00:30:17,660 --> 00:30:19,050
So turns out to be more than 2.

583
00:30:24,762 --> 00:30:26,298
AUDIENCE: 2.5.

584
00:30:26,298 --> 00:30:27,340
JULIAN SHUN: What's left?

585
00:30:27,340 --> 00:30:28,230
AUDIENCE: 3.

586
00:30:28,230 --> 00:30:28,970
JULIAN SHUN: 3.

587
00:30:28,970 --> 00:30:29,470
OK.

588
00:30:31,960 --> 00:30:36,250
So 3 of these nodes have to
be executed sequentially.

589
00:30:36,250 --> 00:30:38,330
Because when you're
executing these nodes,

590
00:30:38,330 --> 00:30:40,960
there's nothing else that
can happen in parallel.

591
00:30:40,960 --> 00:30:43,900
For all of the remaining nodes,
when you're executing them,

592
00:30:43,900 --> 00:30:46,510
you can potentially
be executing some

593
00:30:46,510 --> 00:30:48,310
of the other nodes in parallel.

594
00:30:48,310 --> 00:30:52,060
But for these three nodes
that I've colored in yellow,

595
00:30:52,060 --> 00:30:53,770
you have to execute
those sequentially,

596
00:30:53,770 --> 00:30:57,940
because there's nothing else
that's going on in parallel.

597
00:30:57,940 --> 00:31:00,790
So according to
Amdahl's Law, this

598
00:31:00,790 --> 00:31:04,910
says that the serial fraction
of the program is 3 over 18.

599
00:31:04,910 --> 00:31:08,590
So there's 18 nodes
in this graph here.

600
00:31:08,590 --> 00:31:11,890
So therefore, the serial
factor is 1 over 6,

601
00:31:11,890 --> 00:31:17,170
and the speedup is upper bound
by 1 over that, which is 6.

602
00:31:17,170 --> 00:31:20,920
So Amdahl's Law tells us that
the maximum speedup we can get

603
00:31:20,920 --> 00:31:23,470
is 6.

604
00:31:23,470 --> 00:31:26,080
Any questions on how I
got this number here?

605
00:31:31,450 --> 00:31:34,200
So it turns out that Amdahl's
Law actually gives us

606
00:31:34,200 --> 00:31:38,190
a pretty loose upper
bound on the parallelism,

607
00:31:38,190 --> 00:31:41,108
and it's not that useful
in many practical cases.

608
00:31:41,108 --> 00:31:42,900
So we're actually going
to look at a better

609
00:31:42,900 --> 00:31:45,270
definition of parallelism
that will give us

610
00:31:45,270 --> 00:31:48,720
a better upper bound on the
maximum speedup we can get.

611
00:31:52,060 --> 00:31:55,720
So we're going to define T
sub P to be the execution time

612
00:31:55,720 --> 00:31:59,770
of the program on P processors.

613
00:31:59,770 --> 00:32:01,860
And T sub 1 is just the work.

614
00:32:01,860 --> 00:32:05,910
So T sub 1 is if you executed
this program on one processor,

615
00:32:05,910 --> 00:32:07,380
how much stuff do
you have to do?

616
00:32:07,380 --> 00:32:09,550
And we define that
to be the work.

617
00:32:09,550 --> 00:32:12,690
Recall in lecture 2,
we looked at many ways

618
00:32:12,690 --> 00:32:14,500
to optimize the work.

619
00:32:14,500 --> 00:32:15,420
This is the work term.

620
00:32:20,450 --> 00:32:23,140
So in this example,
the number of nodes

621
00:32:23,140 --> 00:32:26,635
here is 18, so the work
is just going to be 18.

622
00:32:31,360 --> 00:32:35,050
We also define T of
infinity to be the span.

623
00:32:35,050 --> 00:32:37,420
The span is also called
the critical path

624
00:32:37,420 --> 00:32:41,020
length, or the computational
depth, of the graph.

625
00:32:41,020 --> 00:32:44,050
And this is equal to the
longest directed path

626
00:32:44,050 --> 00:32:48,750
you can find in this graph.

627
00:32:48,750 --> 00:32:51,600
So in this example,
the longest path is 9.

628
00:32:51,600 --> 00:32:54,180
So one of the students
answered 9 earlier,

629
00:32:54,180 --> 00:32:58,870
and this is actually
the span of this graph.

630
00:32:58,870 --> 00:33:01,228
So there are 9 nodes
along this path here,

631
00:33:01,228 --> 00:33:02,895
and that's the longest
one you can find.

632
00:33:08,790 --> 00:33:12,180
And we call this T of infinity
because that's actually

633
00:33:12,180 --> 00:33:14,700
the execution time
of this program

634
00:33:14,700 --> 00:33:18,760
if you had an infinite
number of processors.

635
00:33:18,760 --> 00:33:20,370
So there are two
laws that are going

636
00:33:20,370 --> 00:33:22,320
to relate these quantities.

637
00:33:22,320 --> 00:33:26,520
So the work law
says that T sub P

638
00:33:26,520 --> 00:33:30,030
is greater than or equal
to T sub 1 divided by P.

639
00:33:30,030 --> 00:33:33,480
So this says that the
execution time on P processors

640
00:33:33,480 --> 00:33:35,850
has to be greater than
or equal to the work

641
00:33:35,850 --> 00:33:40,020
of the program divided by the
number of processors you have.

642
00:33:40,020 --> 00:33:43,090
Does anyone see why
the work law is true?

643
00:33:43,090 --> 00:33:47,280
So the answer is that if you
have P processors, on each time

644
00:33:47,280 --> 00:33:49,980
stub, you can do,
at most, P work.

645
00:33:49,980 --> 00:33:53,020
So if you multiply
both sides by P,

646
00:33:53,020 --> 00:33:57,480
you get P times T sub P is
greater than or equal to T1.

647
00:33:57,480 --> 00:34:00,780
If P times T sub P
was less than T1, then

648
00:34:00,780 --> 00:34:03,030
that means you're not
done with the computation,

649
00:34:03,030 --> 00:34:05,340
because you haven't
done all the work yet.

650
00:34:05,340 --> 00:34:07,560
So the work law
says that T sub P

651
00:34:07,560 --> 00:34:12,510
has to be greater than
or equal to T1 over P.

652
00:34:12,510 --> 00:34:13,770
Any questions on the work law?

653
00:34:16,900 --> 00:34:18,610
So let's look at another law.

654
00:34:18,610 --> 00:34:20,350
This is called the span law.

655
00:34:20,350 --> 00:34:24,909
It says that T sub P has to be
greater than or equal to T sub

656
00:34:24,909 --> 00:34:25,449
infinity.

657
00:34:25,449 --> 00:34:27,460
So the execution
time on P processors

658
00:34:27,460 --> 00:34:31,120
has to be at least execution
time on an infinite number

659
00:34:31,120 --> 00:34:32,920
of processors.

660
00:34:32,920 --> 00:34:36,780
Anyone know why the
span law has to be true?

661
00:34:36,780 --> 00:34:39,570
So another way to see
this is that if you

662
00:34:39,570 --> 00:34:41,400
had an infinite
number of processors,

663
00:34:41,400 --> 00:34:43,800
you can actually simulate
a P processor system.

664
00:34:43,800 --> 00:34:46,320
You just use P of the
processors and leave all

665
00:34:46,320 --> 00:34:48,630
the remaining processors idle.

666
00:34:48,630 --> 00:34:51,000
And that can't slow
down your program.

667
00:34:51,000 --> 00:34:54,360
So therefore, you
have that T sub P

668
00:34:54,360 --> 00:34:56,940
has to be greater than or
equal to T sub infinity.

669
00:34:56,940 --> 00:34:58,740
If you add more
processors to it,

670
00:34:58,740 --> 00:35:00,660
the running time can't go up.

671
00:35:03,570 --> 00:35:04,663
Any questions?

672
00:35:09,756 --> 00:35:12,040
So let's see how we
can compose the work

673
00:35:12,040 --> 00:35:14,890
and the span quantities
of different computations.

674
00:35:14,890 --> 00:35:18,100
So let's say I have two
computations, A and B.

675
00:35:18,100 --> 00:35:22,780
And let's say that A
has to execute before B.

676
00:35:22,780 --> 00:35:24,610
So everything in
A has to be done

677
00:35:24,610 --> 00:35:28,120
before I start the
computation in B. Let's say

678
00:35:28,120 --> 00:35:32,740
I know what the work of A and
the work of B individually are.

679
00:35:32,740 --> 00:35:35,440
What would be the
work of A union B?

680
00:35:44,720 --> 00:35:45,220
Yes?

681
00:35:45,220 --> 00:35:49,480
AUDIENCE: I guess it
would be T1 A plus T1 B.

682
00:35:49,480 --> 00:35:50,230
JULIAN SHUN: Yeah.

683
00:35:50,230 --> 00:35:51,938
So why is that?

684
00:35:51,938 --> 00:35:54,408
AUDIENCE: Well, you have
to execute sequentially.

685
00:35:54,408 --> 00:35:57,866
So then you just take the time
and [INAUDIBLE] execute A,

686
00:35:57,866 --> 00:35:59,360
then it'll execute B after that.

687
00:35:59,360 --> 00:36:00,430
JULIAN SHUN: Yeah.

688
00:36:00,430 --> 00:36:03,460
So the work is just going to
be the sum of the work of A

689
00:36:03,460 --> 00:36:07,090
and the work of B. Because you
have to do all of the work of A

690
00:36:07,090 --> 00:36:09,280
and then do all
of the work of B,

691
00:36:09,280 --> 00:36:12,960
so you just add them together.

692
00:36:12,960 --> 00:36:13,920
What about the span?

693
00:36:13,920 --> 00:36:15,720
So let's say I
know the span of A

694
00:36:15,720 --> 00:36:20,100
and I know the span of B.
What's the span of A union B?

695
00:36:20,100 --> 00:36:25,230
So again, it's just a sum of
the span of A and the span of B.

696
00:36:25,230 --> 00:36:27,240
This is because I have
to execute everything

697
00:36:27,240 --> 00:36:33,840
in A before I start B. So I
just sum together the spans.

698
00:36:33,840 --> 00:36:36,180
So this is series composition.

699
00:36:36,180 --> 00:36:38,110
What if I do
parallel composition?

700
00:36:38,110 --> 00:36:41,070
So let's say here,
I'm executing the two

701
00:36:41,070 --> 00:36:44,760
computations in parallel.

702
00:36:44,760 --> 00:36:46,620
What's the work of A union B?

703
00:36:54,305 --> 00:36:56,180
So it's not it's not
going to be the maximum.

704
00:36:59,170 --> 00:36:59,670
Yes?

705
00:36:59,670 --> 00:37:01,997
AUDIENCE: It should still
be T1 of A plus T1 of B.

706
00:37:01,997 --> 00:37:03,580
JULIAN SHUN: Yeah,
so it's still going

707
00:37:03,580 --> 00:37:06,640
to be the sum of T1
of A and T1 of B.

708
00:37:06,640 --> 00:37:08,890
Because you still have
the same amount of work

709
00:37:08,890 --> 00:37:10,870
that you have to do.

710
00:37:10,870 --> 00:37:13,120
It's just that you're
doing it in parallel.

711
00:37:13,120 --> 00:37:16,662
But the work is just the time
if you had one processor.

712
00:37:16,662 --> 00:37:18,370
So if you had one
processor, you wouldn't

713
00:37:18,370 --> 00:37:20,380
be executing these in parallel.

714
00:37:20,380 --> 00:37:21,430
What about the span?

715
00:37:21,430 --> 00:37:24,040
So if I know the span
of A and the span of B,

716
00:37:24,040 --> 00:37:27,440
what's the span of the parallel
composition of the two?

717
00:37:34,310 --> 00:37:34,810
Yes?

718
00:37:34,810 --> 00:37:37,330
AUDIENCE: [INAUDIBLE]

719
00:37:37,330 --> 00:37:41,410
JULIAN SHUN: Yeah, so
the span of A union B

720
00:37:41,410 --> 00:37:44,590
is going to be the max of the
span of A and the span of B,

721
00:37:44,590 --> 00:37:47,590
because I'm going
to be bottlenecked

722
00:37:47,590 --> 00:37:50,140
by the slower of the
two computations.

723
00:37:50,140 --> 00:37:52,960
So I just take the one
that has longer span,

724
00:37:52,960 --> 00:37:54,550
and that gives me
the overall span.

725
00:37:57,903 --> 00:37:59,340
Any questions?

726
00:38:05,150 --> 00:38:07,160
So here's another definition.

727
00:38:07,160 --> 00:38:14,190
So T1 divided by TP is the
speedup on P processors.

728
00:38:14,190 --> 00:38:18,060
If I have T1 divided
by TP less than P, then

729
00:38:18,060 --> 00:38:20,010
this means that I have
sub-linear speedup.

730
00:38:20,010 --> 00:38:22,290
I'm not making use of
all the processors.

731
00:38:22,290 --> 00:38:24,210
Because I'm using P
processors, but I'm not

732
00:38:24,210 --> 00:38:27,650
getting a speedup of P.

733
00:38:27,650 --> 00:38:31,050
If T1 over TP is
equal to P, then I'm

734
00:38:31,050 --> 00:38:32,820
getting perfect linear speedup.

735
00:38:32,820 --> 00:38:35,370
I'm making use of
all of my processors.

736
00:38:35,370 --> 00:38:38,880
I'm putting P times as many
resources into my computation,

737
00:38:38,880 --> 00:38:40,900
and it becomes P times faster.

738
00:38:40,900 --> 00:38:42,930
So this is the good case.

739
00:38:42,930 --> 00:38:46,680
And finally, if T1 over
TP is greater than P,

740
00:38:46,680 --> 00:38:49,740
we have something called
superlinear speedup.

741
00:38:49,740 --> 00:38:51,660
In our simple
performance model, this

742
00:38:51,660 --> 00:38:53,800
can't actually happen,
because of the work law.

743
00:38:53,800 --> 00:38:58,848
The work law says that TP has
to be at least T1 divided by P.

744
00:38:58,848 --> 00:39:00,390
So if you rearrange
the terms, you'll

745
00:39:00,390 --> 00:39:03,630
see that we get a
contradiction in our model.

746
00:39:03,630 --> 00:39:07,140
In practice, you might sometimes
see that you have a superlinear

747
00:39:07,140 --> 00:39:10,410
speedup, because when you're
using more processors,

748
00:39:10,410 --> 00:39:12,570
you might have
access to more cache,

749
00:39:12,570 --> 00:39:15,420
and that could improve the
performance of your program.

750
00:39:15,420 --> 00:39:18,330
But in general, you might see
a little bit of superlinear

751
00:39:18,330 --> 00:39:20,260
speedup, but not that much.

752
00:39:20,260 --> 00:39:22,290
And in our simplified
model, we're

753
00:39:22,290 --> 00:39:24,880
just going to assume that
you can't have a superlinear

754
00:39:24,880 --> 00:39:25,380
speedup.

755
00:39:25,380 --> 00:39:27,990
And getting perfect linear
speedup is already very good.

756
00:39:34,220 --> 00:39:40,010
So because the span law says
that TP has to be at least T

757
00:39:40,010 --> 00:39:42,770
infinity, the maximum
possible speedup

758
00:39:42,770 --> 00:39:45,830
is just going to be T1
divided by T infinity,

759
00:39:45,830 --> 00:39:50,090
and that's the parallelism
of your computation.

760
00:39:50,090 --> 00:39:52,610
This is a maximum possible
speedup you can get.

761
00:39:52,610 --> 00:39:56,030
Another way to view
this is that it's

762
00:39:56,030 --> 00:39:58,100
equal to the average
amount of work

763
00:39:58,100 --> 00:40:01,880
that you have to do per
step along the span.

764
00:40:01,880 --> 00:40:03,980
So for every step
along the span,

765
00:40:03,980 --> 00:40:05,450
you're doing this much work.

766
00:40:05,450 --> 00:40:08,240
And after all the steps, then
you've done all of the work.

767
00:40:11,500 --> 00:40:15,580
So what's the parallelism of
this computation dag here?

768
00:40:25,807 --> 00:40:26,790
AUDIENCE: 2.

769
00:40:26,790 --> 00:40:27,870
JULIAN SHUN: 2.

770
00:40:27,870 --> 00:40:28,985
Why is it 2?

771
00:40:28,985 --> 00:40:31,560
AUDIENCE: T1 is 18
and T infinity is 9.

772
00:40:31,560 --> 00:40:32,310
JULIAN SHUN: Yeah.

773
00:40:32,310 --> 00:40:33,750
So T1 is 18.

774
00:40:33,750 --> 00:40:36,040
There are 18 nodes
in this graph.

775
00:40:36,040 --> 00:40:38,780
T infinity is 9.

776
00:40:38,780 --> 00:40:42,820
And the last time I checked,
18 divided by 9 is 2.

777
00:40:42,820 --> 00:40:45,000
So the parallelism here is 2.

778
00:40:47,680 --> 00:40:51,130
So now we can go back to
our Fibonacci example,

779
00:40:51,130 --> 00:40:54,700
and we can also analyze the
work and the span of this

780
00:40:54,700 --> 00:40:58,730
and compute the
maximum parallelism.

781
00:40:58,730 --> 00:41:01,300
So again, for
simplicity, let's assume

782
00:41:01,300 --> 00:41:03,800
that each of these strands
takes unit time to execute.

783
00:41:03,800 --> 00:41:05,800
Again, in practice, that's
not necessarily true.

784
00:41:05,800 --> 00:41:10,570
But for simplicity,
let's just assume that.

785
00:41:10,570 --> 00:41:13,660
So what's the work
of this computation?

786
00:41:20,282 --> 00:41:22,190
AUDIENCE: 17.

787
00:41:22,190 --> 00:41:23,270
JULIAN SHUN: 17.

788
00:41:23,270 --> 00:41:24,290
Right.

789
00:41:24,290 --> 00:41:26,510
So the work is just
the number of nodes

790
00:41:26,510 --> 00:41:27,710
you have in this graph.

791
00:41:27,710 --> 00:41:31,580
And you can just count
that up, and you get 17.

792
00:41:31,580 --> 00:41:32,450
What about the span?

793
00:41:37,150 --> 00:41:39,590
Somebody said 8.

794
00:41:39,590 --> 00:41:41,570
Yeah, so the span is 8.

795
00:41:41,570 --> 00:41:44,950
And here's the longest path.

796
00:41:44,950 --> 00:41:47,780
So this is the path
that has 8 nodes in it,

797
00:41:47,780 --> 00:41:50,570
and that's the longest
one you can find here.

798
00:41:50,570 --> 00:41:52,690
So therefore, the
parallelism is just 17

799
00:41:52,690 --> 00:41:58,300
divided by 8, which is 2.125.

800
00:41:58,300 --> 00:42:01,000
And so for all of you who
guessed that the parallelism

801
00:42:01,000 --> 00:42:04,900
was 2, you were very close.

802
00:42:04,900 --> 00:42:08,710
This tells us that using
many more than two processors

803
00:42:08,710 --> 00:42:12,490
can only yield us marginal
performance gains.

804
00:42:12,490 --> 00:42:16,040
Because the maximum speedup
we can get is 2.125.

805
00:42:16,040 --> 00:42:18,370
So we throw eight processors
at this computation,

806
00:42:18,370 --> 00:42:27,530
we're not going to get
a speedup beyond 2.125.

807
00:42:27,530 --> 00:42:30,200
So to figure out
how much parallelism

808
00:42:30,200 --> 00:42:33,080
is in your computation,
you need to analyze

809
00:42:33,080 --> 00:42:36,770
the work of your computation
and the span of your computation

810
00:42:36,770 --> 00:42:39,820
and then take the ratio
between the two quantities.

811
00:42:39,820 --> 00:42:42,560
But for large computations,
it's actually pretty tedious

812
00:42:42,560 --> 00:42:43,730
to analyze this by hand.

813
00:42:43,730 --> 00:42:45,590
You don't want to
draw these things out

814
00:42:45,590 --> 00:42:47,960
by hand for a very
large computation.

815
00:42:47,960 --> 00:42:51,440
And fortunately, Cilk has
a tool called the Cilkscale

816
00:42:51,440 --> 00:42:53,750
Scalability Analyzer.

817
00:42:53,750 --> 00:42:57,140
So this is integrated into
the Tapir/LLVM compiler

818
00:42:57,140 --> 00:43:00,420
that you'll be using
for this course.

819
00:43:00,420 --> 00:43:04,670
And Cilkscale uses
compiler instrumentation

820
00:43:04,670 --> 00:43:07,040
to analyze a serial
execution of a program,

821
00:43:07,040 --> 00:43:10,010
and it's going to generate the
work and the span quantities

822
00:43:10,010 --> 00:43:12,050
and then use those
quantities to derive

823
00:43:12,050 --> 00:43:16,737
upper bounds on the parallel
speedup of your program.

824
00:43:16,737 --> 00:43:18,320
So you'll have a
chance to play around

825
00:43:18,320 --> 00:43:20,750
with Cilkscale in homework 4.

826
00:43:23,640 --> 00:43:28,800
So let's try to analyze the
parallelism of quicksort.

827
00:43:28,800 --> 00:43:32,810
And here, we're using a
parallel quicksort algorithm.

828
00:43:32,810 --> 00:43:35,670
The function quicksort
here takes two inputs.

829
00:43:35,670 --> 00:43:37,200
These are two pointers.

830
00:43:37,200 --> 00:43:40,750
Left points to the beginning of
the array that we want to sort.

831
00:43:40,750 --> 00:43:45,750
Right points to one element
after the end of the array.

832
00:43:45,750 --> 00:43:50,880
And what we do is we first
check if left is equal to right.

833
00:43:50,880 --> 00:43:53,400
If so, then we just return,
because there are no elements

834
00:43:53,400 --> 00:43:54,900
to sort.

835
00:43:54,900 --> 00:43:57,750
Otherwise, we're going to
call this partition function.

836
00:43:57,750 --> 00:44:02,310
The partition function is
going to pick a random pivot--

837
00:44:02,310 --> 00:44:04,830
so this is a randomized
quicksort algorithm--

838
00:44:04,830 --> 00:44:08,610
and then it's going to
move everything that's

839
00:44:08,610 --> 00:44:11,190
less than the pivot to
the left part of the array

840
00:44:11,190 --> 00:44:13,980
and everything
that's greater than

841
00:44:13,980 --> 00:44:16,370
or equal to the pivot to
the right part of the array.

842
00:44:16,370 --> 00:44:19,530
It's also going to return
us a pointer to the pivot.

843
00:44:19,530 --> 00:44:22,890
And then now we can execute
two recursive calls.

844
00:44:22,890 --> 00:44:25,530
So we do quicksort on the
left side and quicksort

845
00:44:25,530 --> 00:44:26,280
on the right side.

846
00:44:26,280 --> 00:44:28,450
And this can happen in parallel.

847
00:44:28,450 --> 00:44:31,320
So we use the cilk_spawn here
to spawn off one of these calls

848
00:44:31,320 --> 00:44:32,790
to quicksort in parallel.

849
00:44:32,790 --> 00:44:36,030
And therefore, the two
recursive calls are parallel.

850
00:44:36,030 --> 00:44:38,160
And then finally,
we sync up before we

851
00:44:38,160 --> 00:44:39,300
return from the function.

852
00:44:44,640 --> 00:44:49,080
So let's say we wanted
to sort 1 million numbers

853
00:44:49,080 --> 00:44:51,600
with this quicksort algorithm.

854
00:44:51,600 --> 00:44:54,570
And let's also assume that
the partition function here

855
00:44:54,570 --> 00:44:56,910
is written sequentially,
so you have

856
00:44:56,910 --> 00:45:00,030
to go through all of the
elements, one by one.

857
00:45:00,030 --> 00:45:01,890
Can anyone guess
what the parallelism

858
00:45:01,890 --> 00:45:05,406
is in this computation?

859
00:45:05,406 --> 00:45:08,400
AUDIENCE: 1 million.

860
00:45:08,400 --> 00:45:10,590
JULIAN SHUN: So the
guess was 1 million.

861
00:45:10,590 --> 00:45:11,564
Any other guesses?

862
00:45:19,468 --> 00:45:20,460
AUDIENCE: 50,000.

863
00:45:20,460 --> 00:45:23,620
JULIAN SHUN: 50,000.

864
00:45:23,620 --> 00:45:24,970
Any other guesses?

865
00:45:24,970 --> 00:45:25,656
Yes?

866
00:45:25,656 --> 00:45:26,490
AUDIENCE: 2.

867
00:45:26,490 --> 00:45:28,020
JULIAN SHUN: 2.

868
00:45:28,020 --> 00:45:31,255
It's a good guess.

869
00:45:31,255 --> 00:45:32,740
AUDIENCE: Log 2 of a million.

870
00:45:32,740 --> 00:45:34,660
JULIAN SHUN: Log
base 2 of a million.

871
00:45:37,500 --> 00:45:38,820
Any other guesses?

872
00:45:38,820 --> 00:45:45,270
So log base 2 of a million,
2, 50,000, and 1 million.

873
00:45:45,270 --> 00:45:48,520
Anyone think it's
more than 1 million?

874
00:45:48,520 --> 00:45:49,020
No.

875
00:45:49,020 --> 00:45:51,000
So no takers on
more than 1 million.

876
00:45:54,400 --> 00:45:57,820
So if you run this
program using Cilkscale,

877
00:45:57,820 --> 00:46:01,540
it will generate a plot
that looks like this.

878
00:46:01,540 --> 00:46:03,260
And there are several
lines on this plot.

879
00:46:03,260 --> 00:46:06,970
So let's talk about what
each of these lines mean.

880
00:46:06,970 --> 00:46:11,470
So this purple line
here is the speedup

881
00:46:11,470 --> 00:46:13,750
that you observe
in your computation

882
00:46:13,750 --> 00:46:15,250
when you're running it.

883
00:46:15,250 --> 00:46:18,910
And you can get that by
taking the single processor

884
00:46:18,910 --> 00:46:21,220
running time and dividing
it by the running

885
00:46:21,220 --> 00:46:22,540
time on P processors.

886
00:46:22,540 --> 00:46:24,160
So this is the observed speedup.

887
00:46:24,160 --> 00:46:27,280
That's the purple line.

888
00:46:27,280 --> 00:46:32,860
The blue line here is the line
that you get from the span law.

889
00:46:32,860 --> 00:46:36,070
So this is T1 over T infinity.

890
00:46:36,070 --> 00:46:41,950
And here, this gives us a bound
of about 6 for the parallelism.

891
00:46:41,950 --> 00:46:44,950
The green line is the
bound from the work law.

892
00:46:44,950 --> 00:46:50,800
So this is just a linear
line with a slope of 1.

893
00:46:50,800 --> 00:46:52,600
It says that on
P processors, you

894
00:46:52,600 --> 00:46:55,750
can't get more than a
factor of P speedup.

895
00:46:55,750 --> 00:46:58,450
So therefore, the maximum
speedup you can get

896
00:46:58,450 --> 00:47:02,840
has to be below the green
line and below the blue line.

897
00:47:02,840 --> 00:47:07,780
So you're in this lower
right quadrant of the plot.

898
00:47:07,780 --> 00:47:09,340
There's also this
orange line, which

899
00:47:09,340 --> 00:47:12,910
is the speedup you would get
if you used a greedy scheduler.

900
00:47:12,910 --> 00:47:15,340
We'll talk more about
the greedy scheduler

901
00:47:15,340 --> 00:47:18,140
later on in this lecture.

902
00:47:18,140 --> 00:47:21,610
So this is the plot
that you would get.

903
00:47:21,610 --> 00:47:27,190
And we see here that the
maximum speedup is about 5.

904
00:47:27,190 --> 00:47:31,160
So for those of you who guessed
2 and log base 2 of a million,

905
00:47:31,160 --> 00:47:32,035
you were the closest.

906
00:47:35,500 --> 00:47:38,380
You can also
generate a plot that

907
00:47:38,380 --> 00:47:40,630
just tells you the execution
time versus the number

908
00:47:40,630 --> 00:47:42,820
of processors.

909
00:47:42,820 --> 00:47:45,550
And you can get
this quite easily

910
00:47:45,550 --> 00:47:47,260
just by doing a
simple transformation

911
00:47:47,260 --> 00:47:50,050
from the previous plot.

912
00:47:50,050 --> 00:47:52,750
So Cilkscale is going to give
you these useful plots that you

913
00:47:52,750 --> 00:47:58,090
can use to figure out how much
parallelism is in your program.

914
00:47:58,090 --> 00:48:06,130
And let's see why the
parallelism here is so low.

915
00:48:06,130 --> 00:48:09,490
So I said that we were going
to execute this partition

916
00:48:09,490 --> 00:48:11,980
function sequentially,
and it turns out

917
00:48:11,980 --> 00:48:14,758
that that's actually the
bottleneck to the parallelism.

918
00:48:18,610 --> 00:48:22,600
So the expected work of
quicksort is order n log n.

919
00:48:22,600 --> 00:48:24,580
So some of you
might have seen this

920
00:48:24,580 --> 00:48:27,130
in your previous
algorithms courses.

921
00:48:27,130 --> 00:48:29,140
If you haven't seen
this yet, then you

922
00:48:29,140 --> 00:48:31,540
can take a look at your
favorite textbook, Introduction

923
00:48:31,540 --> 00:48:34,690
to Algorithms.

924
00:48:34,690 --> 00:48:37,240
It turns out that the
parallel version of quicksort

925
00:48:37,240 --> 00:48:40,330
also has an expected work
bound of order n log n,

926
00:48:40,330 --> 00:48:41,980
if you pick a random pivot.

927
00:48:41,980 --> 00:48:43,120
So the analysis is similar.

928
00:48:45,730 --> 00:48:50,530
The expected span bound
turns out to be at least n.

929
00:48:50,530 --> 00:48:53,170
And this is because on the
first level of recursion,

930
00:48:53,170 --> 00:48:56,050
we have to call this
partition function, which

931
00:48:56,050 --> 00:48:58,630
is going to go through
the elements one by one.

932
00:48:58,630 --> 00:49:01,580
So that already
has a linear span.

933
00:49:01,580 --> 00:49:05,920
And it turns out that the
overall span is also order n,

934
00:49:05,920 --> 00:49:07,690
because the span
actually works out

935
00:49:07,690 --> 00:49:13,980
to be a geometrically decreasing
sequence and sums to order n.

936
00:49:13,980 --> 00:49:17,140
And therefore, the maximum
parallelism you can get

937
00:49:17,140 --> 00:49:19,210
is order log n.

938
00:49:19,210 --> 00:49:22,540
So you just take the
work divided by the span.

939
00:49:22,540 --> 00:49:25,390
So for the student who guessed
that the parallelism is log

940
00:49:25,390 --> 00:49:28,540
base 2 of n, that's very good.

941
00:49:28,540 --> 00:49:30,728
Turns out that it's
not exactly log base

942
00:49:30,728 --> 00:49:32,770
2 of n, because there are
constants in these work

943
00:49:32,770 --> 00:49:37,330
and span bounds, so it's
on the order of log of n.

944
00:49:37,330 --> 00:49:38,890
That's the parallelism.

945
00:49:38,890 --> 00:49:42,898
And it turns out that order log
n parallelism is not very high.

946
00:49:42,898 --> 00:49:45,190
In general, you want the
parallelism to be much higher,

947
00:49:45,190 --> 00:49:49,600
something polynomial in n.

948
00:49:49,600 --> 00:49:52,030
And in order to get
more parallelism

949
00:49:52,030 --> 00:49:58,060
in this algorithm,
what you have to do

950
00:49:58,060 --> 00:50:00,310
is you have to
parallelize this partition

951
00:50:00,310 --> 00:50:02,320
function, because
right now I I'm

952
00:50:02,320 --> 00:50:04,540
just executing
this sequentially.

953
00:50:04,540 --> 00:50:07,630
But you can actually indeed
write a parallel partition

954
00:50:07,630 --> 00:50:12,520
function that takes linear
your work in order log n span.

955
00:50:12,520 --> 00:50:15,100
And then this would give you
an overall span bound of log

956
00:50:15,100 --> 00:50:16,090
squared n.

957
00:50:16,090 --> 00:50:18,340
And then if you take n log
n divided by log squared n,

958
00:50:18,340 --> 00:50:20,090
that gives you an
overall parallelism of n

959
00:50:20,090 --> 00:50:24,532
over log n, which is much
higher than order log n here.

960
00:50:24,532 --> 00:50:26,740
And similarly, if you were
to implement a merge sort,

961
00:50:26,740 --> 00:50:29,830
you would also need to make
sure that the merging routine is

962
00:50:29,830 --> 00:50:31,330
implemented in
parallel, if you want

963
00:50:31,330 --> 00:50:32,590
to see significant speedup.

964
00:50:32,590 --> 00:50:35,165
So not only do you have to
execute the two recursive calls

965
00:50:35,165 --> 00:50:36,790
in parallel, you also
need to make sure

966
00:50:36,790 --> 00:50:41,790
that the merging portion of
the code is done in parallel.

967
00:50:41,790 --> 00:50:43,040
Any questions on this example?

968
00:50:49,019 --> 00:50:50,936
AUDIENCE: In the graph
that you had, sometimes

969
00:50:50,936 --> 00:50:55,610
when you got to higher processor
numbers, it got jagged,

970
00:50:55,610 --> 00:50:59,040
and so sometimes adding a
processor was making it slower.

971
00:50:59,040 --> 00:51:00,960
What are some
reasons [INAUDIBLE]??

972
00:51:00,960 --> 00:51:04,555
JULIAN SHUN: Yeah so I believe
that's just due to noise,

973
00:51:04,555 --> 00:51:06,680
because there's some noise
going on in the machine.

974
00:51:06,680 --> 00:51:08,720
So if you ran it
enough times and took

975
00:51:08,720 --> 00:51:12,110
the average or the median,
it should be always going up,

976
00:51:12,110 --> 00:51:14,000
or it shouldn't be
decreasing, at least.

977
00:51:17,380 --> 00:51:17,880
Yes?

978
00:51:17,880 --> 00:51:22,740
AUDIENCE: So [INAUDIBLE]
is also [INAUDIBLE]??

979
00:51:27,600 --> 00:51:29,650
JULIAN SHUN: So at one
level of recursion,

980
00:51:29,650 --> 00:51:33,060
the partition function
takes order log n span.

981
00:51:33,060 --> 00:51:35,580
You can show that there are
log n levels of recursion

982
00:51:35,580 --> 00:51:37,660
in this quicksort algorithm.

983
00:51:37,660 --> 00:51:40,360
I didn't go over the
details of this analysis,

984
00:51:40,360 --> 00:51:42,690
but you can show that.

985
00:51:42,690 --> 00:51:44,190
And then therefore,
the overall span

986
00:51:44,190 --> 00:51:45,930
is going to be
order log squared.

987
00:51:45,930 --> 00:51:47,820
And I can show you on
the board after class,

988
00:51:47,820 --> 00:51:50,010
if you're interested, or I
can give you a reference.

989
00:51:53,090 --> 00:51:54,020
Other questions?

990
00:51:59,640 --> 00:52:04,020
So it turns out that in
addition to quicksort,

991
00:52:04,020 --> 00:52:06,540
there are also many other
interesting practical parallel

992
00:52:06,540 --> 00:52:07,570
algorithms out there.

993
00:52:07,570 --> 00:52:09,270
So here, I've listed
a few of them.

994
00:52:09,270 --> 00:52:12,480
And by practical, I mean
that the Cilk program running

995
00:52:12,480 --> 00:52:14,820
on one processor is
competitive with the best

996
00:52:14,820 --> 00:52:17,640
sequential program
for that problem.

997
00:52:17,640 --> 00:52:22,500
And so you can see that I've
listed the work and the span

998
00:52:22,500 --> 00:52:23,880
of merge sort here.

999
00:52:23,880 --> 00:52:26,580
And if you implement
the merge and parallel,

1000
00:52:26,580 --> 00:52:28,350
the span of the
overall computation

1001
00:52:28,350 --> 00:52:29,370
would be log cubed n.

1002
00:52:29,370 --> 00:52:32,905
And log n divided by log cubed
n is n over log squared n.

1003
00:52:32,905 --> 00:52:34,780
That's the parallelism,
which is pretty high.

1004
00:52:34,780 --> 00:52:36,930
And in general, all
of these computations

1005
00:52:36,930 --> 00:52:39,030
have pretty high parallelism.

1006
00:52:39,030 --> 00:52:42,060
Another thing to note is that
these algorithms are practical,

1007
00:52:42,060 --> 00:52:45,120
because their work
bound is asymptotically

1008
00:52:45,120 --> 00:52:48,360
equal to the work of the
corresponding sequential

1009
00:52:48,360 --> 00:52:49,530
algorithm.

1010
00:52:49,530 --> 00:52:52,040
That's known as a work-efficient
parallel algorithm.

1011
00:52:52,040 --> 00:52:54,540
It's actually one of the goals
of parallel algorithm design,

1012
00:52:54,540 --> 00:52:57,300
to come up with work-efficient
parallel algorithms.

1013
00:52:57,300 --> 00:52:58,830
Because this means
that even if you

1014
00:52:58,830 --> 00:53:00,420
have a small number
of processors,

1015
00:53:00,420 --> 00:53:04,140
you can still be competitive
with a sequential algorithm

1016
00:53:04,140 --> 00:53:06,410
running on one processor.

1017
00:53:06,410 --> 00:53:12,330
And in the next
lecture, we actually

1018
00:53:12,330 --> 00:53:15,450
see some examples of
these other algorithms,

1019
00:53:15,450 --> 00:53:17,550
and possibly even ones
not listed on this slide,

1020
00:53:17,550 --> 00:53:20,430
and we'll go over the
work and span analysis

1021
00:53:20,430 --> 00:53:22,091
and figure out the parallelism.

1022
00:53:26,020 --> 00:53:29,290
So now I want to move on to talk
about some scheduling theory.

1023
00:53:29,290 --> 00:53:32,675
So I talked about these
computation dags earlier,

1024
00:53:32,675 --> 00:53:34,300
analyzed the work
and the span of them,

1025
00:53:34,300 --> 00:53:37,630
but I never talked about how
these different strands are

1026
00:53:37,630 --> 00:53:41,140
actually mapped to
processors at running time.

1027
00:53:41,140 --> 00:53:43,275
So let's talk a little bit
about scheduling theory.

1028
00:53:43,275 --> 00:53:45,400
And it turns out that
scheduling theory is actually

1029
00:53:45,400 --> 00:53:46,090
very general.

1030
00:53:46,090 --> 00:53:49,900
It's not just limited
to parallel programming.

1031
00:53:49,900 --> 00:53:54,280
It's used all over the place
in computer science, operations

1032
00:53:54,280 --> 00:53:58,060
research, and math.

1033
00:53:58,060 --> 00:54:00,010
So as a reminder, Cilk
allows the program

1034
00:54:00,010 --> 00:54:03,460
to express potential
parallelism in an application.

1035
00:54:03,460 --> 00:54:05,980
And a Cilk scheduler is
going to map these strands

1036
00:54:05,980 --> 00:54:10,750
onto the processors that you
have available dynamically

1037
00:54:10,750 --> 00:54:13,690
at runtime.

1038
00:54:13,690 --> 00:54:16,900
Cilk actually uses a
distributed scheduler.

1039
00:54:16,900 --> 00:54:19,180
But since the theory of
distributed schedulers

1040
00:54:19,180 --> 00:54:21,040
is a little bit
complicated, we'll

1041
00:54:21,040 --> 00:54:23,590
actually explore the
ideas of scheduling first

1042
00:54:23,590 --> 00:54:25,390
using a centralized scheduler.

1043
00:54:25,390 --> 00:54:29,230
And a centralized
scheduler knows everything

1044
00:54:29,230 --> 00:54:31,660
about what's going on
in the computation,

1045
00:54:31,660 --> 00:54:34,490
and it can use that to
make a good decision.

1046
00:54:34,490 --> 00:54:37,540
So let's first look at what
a centralized scheduler does,

1047
00:54:37,540 --> 00:54:39,580
and then I'll talk a
little bit about the Cilk

1048
00:54:39,580 --> 00:54:40,570
distributed scheduler.

1049
00:54:40,570 --> 00:54:43,120
And we'll learn more about that
in a future lecture as well.

1050
00:54:47,240 --> 00:54:49,710
So we're going to look
at a greedy scheduler.

1051
00:54:49,710 --> 00:54:51,490
And an idea of a
greedy scheduler

1052
00:54:51,490 --> 00:54:53,770
is to just do as
much as possible

1053
00:54:53,770 --> 00:54:56,170
in every step of
the computation.

1054
00:54:56,170 --> 00:54:59,480
So has anyone seen
greedy algorithms before?

1055
00:54:59,480 --> 00:54:59,980
Right.

1056
00:54:59,980 --> 00:55:02,110
So many of you have seen
greedy algorithms before.

1057
00:55:02,110 --> 00:55:03,220
So the idea is similar here.

1058
00:55:03,220 --> 00:55:04,970
We're just going to
do as much as possible

1059
00:55:04,970 --> 00:55:06,100
at the current time step.

1060
00:55:06,100 --> 00:55:08,225
We're not going to think
too much about the future.

1061
00:55:11,820 --> 00:55:14,190
So we're going to
define a ready strand

1062
00:55:14,190 --> 00:55:17,490
to be a strand where all of its
predecessors in the computation

1063
00:55:17,490 --> 00:55:20,710
dag have already executed.

1064
00:55:20,710 --> 00:55:22,560
So in this example
here, let's say

1065
00:55:22,560 --> 00:55:26,320
I already executed all
of these blue strands.

1066
00:55:26,320 --> 00:55:28,740
Then the ones
shaded in yellow are

1067
00:55:28,740 --> 00:55:31,170
going to be my ready
strands, because they

1068
00:55:31,170 --> 00:55:35,540
have all of their
predecessors executed already.

1069
00:55:35,540 --> 00:55:39,600
And there are two types of
steps in a greedy scheduler.

1070
00:55:39,600 --> 00:55:44,160
The first kind of step is
called a complete step.

1071
00:55:44,160 --> 00:55:50,250
And in a complete step, we
have at least P strands ready.

1072
00:55:50,250 --> 00:55:54,600
So if we had P equal to 3, then
we have a complete step now,

1073
00:55:54,600 --> 00:55:58,410
because we have 5 strands
ready, which is greater than 3.

1074
00:55:58,410 --> 00:56:00,480
So what are we going to
do in a complete step?

1075
00:56:00,480 --> 00:56:02,010
What would a greedy
scheduler do?

1076
00:56:04,520 --> 00:56:05,020
Yes?

1077
00:56:05,020 --> 00:56:07,995
AUDIENCE: [INAUDIBLE]

1078
00:56:07,995 --> 00:56:10,120
JULIAN SHUN: Yeah, so a
greedy scheduler would just

1079
00:56:10,120 --> 00:56:11,880
do as much as it can.

1080
00:56:11,880 --> 00:56:16,190
So it would just run any 3 of
these, or any P in general.

1081
00:56:16,190 --> 00:56:20,680
So let's say I picked
these 3 to run.

1082
00:56:20,680 --> 00:56:23,920
So it turns out that these are
actually the worst 3 to run,

1083
00:56:23,920 --> 00:56:26,920
because they don't enable
any new strands to be ready.

1084
00:56:26,920 --> 00:56:30,040
But I can pick those 3.

1085
00:56:30,040 --> 00:56:32,200
And then the
incomplete step is one

1086
00:56:32,200 --> 00:56:34,660
where I have fewer
than P strands ready.

1087
00:56:34,660 --> 00:56:39,070
So here, I have 2 strands
ready, and I have 3 processors.

1088
00:56:39,070 --> 00:56:42,010
So what would I do in
an incomplete step?

1089
00:56:42,010 --> 00:56:46,010
AUDIENCE: Just run through
the strands that are ready.

1090
00:56:46,010 --> 00:56:48,435
JULIAN SHUN: Yeah, so
just run all of them.

1091
00:56:48,435 --> 00:56:50,435
So here, I'm going to
execute these two strands.

1092
00:56:52,980 --> 00:56:54,730
And then we're going
to use complete steps

1093
00:56:54,730 --> 00:56:57,580
and incomplete steps to
analyze the performance

1094
00:56:57,580 --> 00:56:59,350
of the greedy scheduler.

1095
00:56:59,350 --> 00:57:03,130
There's a famous
theorem which was first

1096
00:57:03,130 --> 00:57:06,010
shown by Ron Graham
in 1968 that says

1097
00:57:06,010 --> 00:57:07,660
that any greedy
scheduler achieves

1098
00:57:07,660 --> 00:57:09,610
the following time bound--

1099
00:57:09,610 --> 00:57:15,610
T sub P is less than or equal
to T1 over P plus T infinity.

1100
00:57:15,610 --> 00:57:18,580
And you might recognize the
terms on the right hand side--

1101
00:57:18,580 --> 00:57:22,930
T1 is the work, and T
infinity is the span

1102
00:57:22,930 --> 00:57:26,130
that we saw earlier.

1103
00:57:26,130 --> 00:57:29,755
And here's a simple proof for
why this time bound holds.

1104
00:57:33,030 --> 00:57:35,810
So we can upper bound the
number of complete steps

1105
00:57:35,810 --> 00:57:40,010
in the computation by
T1 over P. And this

1106
00:57:40,010 --> 00:57:43,060
is because each complete step
is going to perform P work.

1107
00:57:43,060 --> 00:57:45,830
So after T1 over
P completes steps,

1108
00:57:45,830 --> 00:57:49,010
we'll have done all the
work in our computation.

1109
00:57:49,010 --> 00:57:51,710
So that means that the
number of complete steps

1110
00:57:51,710 --> 00:57:54,620
can be at most T1 over P.

1111
00:57:54,620 --> 00:57:55,900
So any questions on this?

1112
00:58:02,750 --> 00:58:06,130
So now, let's look at the
number of incomplete steps

1113
00:58:06,130 --> 00:58:08,620
we can have.

1114
00:58:08,620 --> 00:58:11,320
So the number of incomplete
steps we can have

1115
00:58:11,320 --> 00:58:15,890
is upper bounded by the
span, or T infinity.

1116
00:58:15,890 --> 00:58:21,910
And the reason why is that if
you look at the unexecuted dag

1117
00:58:21,910 --> 00:58:25,480
right before you execute
an incomplete step,

1118
00:58:25,480 --> 00:58:28,090
and you measure the span
of that unexecuted dag,

1119
00:58:28,090 --> 00:58:30,880
you'll see that once you
execute an incomplete step,

1120
00:58:30,880 --> 00:58:34,240
it's going to reduce the
span of that dag by 1.

1121
00:58:34,240 --> 00:58:39,310
So here, this is the span
of our unexecuted dag

1122
00:58:39,310 --> 00:58:41,230
that contains just
these seven nodes.

1123
00:58:41,230 --> 00:58:43,270
The span of this is 5.

1124
00:58:43,270 --> 00:58:45,070
And when we execute
an incomplete step,

1125
00:58:45,070 --> 00:58:48,730
we're going to process all the
roots of this unexecuted dag,

1126
00:58:48,730 --> 00:58:51,370
delete them from the
dag, and therefore, we're

1127
00:58:51,370 --> 00:58:54,070
going to reduce the length
of the longest path by 1.

1128
00:58:54,070 --> 00:58:56,200
So when we execute
an incomplete step,

1129
00:58:56,200 --> 00:58:58,480
it decreases the
span from 5 to 4.

1130
00:59:01,690 --> 00:59:05,040
And then the time
bound up here, T sub P,

1131
00:59:05,040 --> 00:59:09,760
is just upper bounded by the
sum of these two types of steps.

1132
00:59:09,760 --> 00:59:13,370
Because after you execute
T1 over P complete steps

1133
00:59:13,370 --> 00:59:15,460
and T infinity
incomplete steps, you

1134
00:59:15,460 --> 00:59:19,705
must have finished the
entire computation.

1135
00:59:19,705 --> 00:59:20,970
So any questions?

1136
00:59:28,590 --> 00:59:31,860
A corollary of this theorem
is that any greedy scheduler

1137
00:59:31,860 --> 00:59:35,250
achieves within a factor of 2
of the optimal running time.

1138
00:59:35,250 --> 00:59:38,370
So this is the optimal
running time of a scheduler

1139
00:59:38,370 --> 00:59:43,680
that knows everything and can
predict the future and so on.

1140
00:59:43,680 --> 00:59:48,330
So let's let TP star be
the execution time produced

1141
00:59:48,330 --> 00:59:51,780
by an optimal scheduler.

1142
00:59:51,780 --> 00:59:55,620
We know that TP star has to
be at least the max of T1

1143
00:59:55,620 --> 00:59:57,690
over P and T infinity.

1144
00:59:57,690 --> 01:00:01,060
This is due to the
work and span laws.

1145
01:00:01,060 --> 01:00:04,530
So it has to be at least
a max of these two terms.

1146
01:00:04,530 --> 01:00:08,850
Otherwise, we wouldn't have
finished the computation.

1147
01:00:08,850 --> 01:00:12,270
So now we can take
the inequality

1148
01:00:12,270 --> 01:00:16,270
we had before for the
greedy scheduler bound--

1149
01:00:16,270 --> 01:00:20,500
so TP is less than or equal
to T1 over P plus T infinity.

1150
01:00:20,500 --> 01:00:23,430
And this is upper bounded by
2 times the max of these two

1151
01:00:23,430 --> 01:00:24,280
terms.

1152
01:00:24,280 --> 01:00:30,150
So A plus B is upper bounded
by 2 times the max of A and B.

1153
01:00:30,150 --> 01:00:32,580
And then now, the max
of T1 over P and T

1154
01:00:32,580 --> 01:00:36,960
infinity is just upper
bounded by TP star.

1155
01:00:36,960 --> 01:00:39,420
So we can substitute
that in, and we

1156
01:00:39,420 --> 01:00:42,810
get that TP is upper
bounded by 2 times

1157
01:00:42,810 --> 01:00:46,440
TP star, which is the running
time of the optimal scheduler.

1158
01:00:46,440 --> 01:00:49,230
So the greedy scheduler
achieves within a factor

1159
01:00:49,230 --> 01:00:51,555
of 2 of the optimal scheduler.

1160
01:00:57,000 --> 01:00:59,368
Here's another corollary.

1161
01:00:59,368 --> 01:01:00,910
This is a more
interesting corollary.

1162
01:01:00,910 --> 01:01:02,850
It says that any greedy
scheduler achieves

1163
01:01:02,850 --> 01:01:06,720
near-perfect linear speedup
whenever T1 divided by T

1164
01:01:06,720 --> 01:01:12,850
infinity is greater
than or equal to P.

1165
01:01:12,850 --> 01:01:14,830
To see why this is true--

1166
01:01:14,830 --> 01:01:17,350
if we have that
T1 over T infinity

1167
01:01:17,350 --> 01:01:20,350
is much greater than P--

1168
01:01:20,350 --> 01:01:25,612
so the double arrows here
mean that the left hand

1169
01:01:25,612 --> 01:01:27,570
side is much greater than
the right hand side--

1170
01:01:27,570 --> 01:01:32,500
then this means that the span
is much less than T1 over P.

1171
01:01:32,500 --> 01:01:35,230
And the greedy scheduling
theorem gives us

1172
01:01:35,230 --> 01:01:40,630
that TP is less than or equal
to T1 over P plus T infinity,

1173
01:01:40,630 --> 01:01:43,150
but T infinity is much
less than T1 over P,

1174
01:01:43,150 --> 01:01:45,250
so the first term
dominates, and we have

1175
01:01:45,250 --> 01:01:48,940
that TP is approximately
equal to T1 over P.

1176
01:01:48,940 --> 01:01:54,760
And therefore, the speedup you
get is T1 over P, which is P.

1177
01:01:54,760 --> 01:01:57,180
And this is linear speedup.

1178
01:02:02,040 --> 01:02:04,910
The quantity T1
divided by P times T

1179
01:02:04,910 --> 01:02:08,270
infinity is known as
the parallel slackness.

1180
01:02:08,270 --> 01:02:11,030
So this is basically
measuring how much more

1181
01:02:11,030 --> 01:02:13,550
parallelism you have in a
computation than the number

1182
01:02:13,550 --> 01:02:15,440
of processors you have.

1183
01:02:15,440 --> 01:02:18,320
And if parallel
slackness is very high,

1184
01:02:18,320 --> 01:02:20,000
then this corollary
is going to hold,

1185
01:02:20,000 --> 01:02:23,660
and you're going to see
near-linear speedup.

1186
01:02:23,660 --> 01:02:26,270
As a rule of thumb, you usually
want the parallel slackness

1187
01:02:26,270 --> 01:02:29,600
of your program
to be at least 10.

1188
01:02:29,600 --> 01:02:33,590
Because if you have a
parallel slackness of just 1,

1189
01:02:33,590 --> 01:02:37,160
you can't actually amortize
the overheads of the scheduling

1190
01:02:37,160 --> 01:02:38,030
mechanism.

1191
01:02:38,030 --> 01:02:40,130
So therefore, you want
the parallel slackness

1192
01:02:40,130 --> 01:02:43,010
to be at least 10 when
you're programming in Cilk.

1193
01:02:50,990 --> 01:02:53,750
So that was the
greedy scheduler.

1194
01:02:53,750 --> 01:02:56,650
Let's talk a little bit
about the Cilk scheduler.

1195
01:02:56,650 --> 01:02:59,600
So Cilk uses a
work-stealing scheduler,

1196
01:02:59,600 --> 01:03:02,630
and it achieves an
expected running time

1197
01:03:02,630 --> 01:03:08,150
of TP equal to T1 over
P plus order T infinity.

1198
01:03:08,150 --> 01:03:10,100
So instead of just
summing the two terms,

1199
01:03:10,100 --> 01:03:12,720
we actually have a big O
in front of the T infinity,

1200
01:03:12,720 --> 01:03:16,820
and this is used to account for
the overheads of scheduling.

1201
01:03:16,820 --> 01:03:18,770
The greedy scheduler
I presented earlier--

1202
01:03:18,770 --> 01:03:21,170
I didn't account for any of
the overheads of scheduling.

1203
01:03:21,170 --> 01:03:23,720
I just assumed that it could
figure out which of the tasks

1204
01:03:23,720 --> 01:03:26,250
to execute.

1205
01:03:26,250 --> 01:03:28,220
So this Cilk
work-stealing scheduler

1206
01:03:28,220 --> 01:03:31,730
has this expected
time provably, so you

1207
01:03:31,730 --> 01:03:35,990
can prove this using
random variables and tail

1208
01:03:35,990 --> 01:03:37,050
bounds of distribution.

1209
01:03:37,050 --> 01:03:39,470
So Charles Leiserson
has a paper that

1210
01:03:39,470 --> 01:03:42,140
talks about how to prove this.

1211
01:03:42,140 --> 01:03:46,730
And empirically, we usually
see that TP is more like T1

1212
01:03:46,730 --> 01:03:48,830
over P plus T infinity.

1213
01:03:48,830 --> 01:03:52,760
So we usually don't see any
big constant in front of the T

1214
01:03:52,760 --> 01:03:56,090
infinity term in practice.

1215
01:03:56,090 --> 01:03:59,780
And therefore, we can get
near-perfect linear speedup,

1216
01:03:59,780 --> 01:04:04,250
as long as the number of
processors is much less than T1

1217
01:04:04,250 --> 01:04:08,690
over T infinity, the
maximum parallelism.

1218
01:04:08,690 --> 01:04:11,780
And as I said earlier, the
instrumentation in Cilkscale

1219
01:04:11,780 --> 01:04:14,150
will allow you to
measure the work and span

1220
01:04:14,150 --> 01:04:17,060
terms so that you can figure
out how much parallelism

1221
01:04:17,060 --> 01:04:20,552
is in your program.

1222
01:04:20,552 --> 01:04:22,034
Any questions?

1223
01:04:28,730 --> 01:04:32,360
So let's talk a little bit
about how the Cilk runtime

1224
01:04:32,360 --> 01:04:33,065
system works.

1225
01:04:36,140 --> 01:04:39,950
So in the Cilk runtime system,
each worker or processor

1226
01:04:39,950 --> 01:04:42,350
maintains a work deque.

1227
01:04:42,350 --> 01:04:44,180
Deque stands for
double-ended queue,

1228
01:04:44,180 --> 01:04:46,160
so it's just short for
double-ended queue.

1229
01:04:46,160 --> 01:04:49,280
It maintains a work
deque of ready strands,

1230
01:04:49,280 --> 01:04:51,860
and it manipulates the
bottom of the deck,

1231
01:04:51,860 --> 01:04:56,060
just like you would in a
stack of a sequential program.

1232
01:04:56,060 --> 01:04:58,490
So here, I have four
processors, and each one of them

1233
01:04:58,490 --> 01:05:03,900
have their own deques, and they
have these things on the stack,

1234
01:05:03,900 --> 01:05:06,650
these function calls,
saves the return address

1235
01:05:06,650 --> 01:05:09,860
to local variables, and so on.

1236
01:05:09,860 --> 01:05:11,660
So a processor can
call a function,

1237
01:05:11,660 --> 01:05:13,700
and when it calls
a function, it just

1238
01:05:13,700 --> 01:05:19,790
places that function's frame
at the bottom of its stack.

1239
01:05:19,790 --> 01:05:23,360
You can also spawn things, so
then it places a spawn frame

1240
01:05:23,360 --> 01:05:25,575
at the bottom of its stack.

1241
01:05:25,575 --> 01:05:27,450
And then these things
can happen in parallel,

1242
01:05:27,450 --> 01:05:29,918
so multiple processes can
be spawning and calling

1243
01:05:29,918 --> 01:05:30,710
things in parallel.

1244
01:05:34,220 --> 01:05:38,330
And you can also return
from a spawn or a call.

1245
01:05:38,330 --> 01:05:40,970
So here, I'm going to
return from a call.

1246
01:05:40,970 --> 01:05:43,330
Then I return from a spawn.

1247
01:05:43,330 --> 01:05:44,870
And at this point,
I don't actually

1248
01:05:44,870 --> 01:05:48,440
have anything left to do
for the second processor.

1249
01:05:48,440 --> 01:05:52,340
So what do I do now, when
I'm left with nothing to do?

1250
01:05:55,060 --> 01:05:55,951
Yes?

1251
01:05:55,951 --> 01:05:59,720
AUDIENCE: Take a [INAUDIBLE].

1252
01:05:59,720 --> 01:06:01,200
JULIAN SHUN: Yeah,
so the idea here

1253
01:06:01,200 --> 01:06:05,640
is to steal some work
from another processor.

1254
01:06:05,640 --> 01:06:08,080
So when a worker runs
out of work to do,

1255
01:06:08,080 --> 01:06:11,640
it's going to steal from the
top of a random victim's deque.

1256
01:06:11,640 --> 01:06:13,990
So it's going to pick one of
these processors at random.

1257
01:06:13,990 --> 01:06:19,140
It's going to roll some dice
to determine who to steal from.

1258
01:06:19,140 --> 01:06:23,670
And let's say that it
picked the third processor.

1259
01:06:23,670 --> 01:06:26,370
Now it's going to
take all of the stuff

1260
01:06:26,370 --> 01:06:29,010
at the top of the deque
up until the next spawn

1261
01:06:29,010 --> 01:06:32,160
and place it into its own deque.

1262
01:06:32,160 --> 01:06:33,960
And then now it has
stuff to do again.

1263
01:06:33,960 --> 01:06:36,900
So now it can continue
executing this code.

1264
01:06:36,900 --> 01:06:42,190
It can spawn stuff,
call stuff, and so on.

1265
01:06:42,190 --> 01:06:45,600
So the idea is that whenever a
worker runs out of work to do,

1266
01:06:45,600 --> 01:06:47,430
it's going to start
stealing some work

1267
01:06:47,430 --> 01:06:48,960
from other processors.

1268
01:06:48,960 --> 01:06:52,710
But if it always has enough
work to do, then it's happy,

1269
01:06:52,710 --> 01:06:56,760
and it doesn't need to steal
things from other processors.

1270
01:06:56,760 --> 01:06:59,310
And this is why MIT gives
us so much work to do,

1271
01:06:59,310 --> 01:07:01,440
so we don't have to steal
work from other people.

1272
01:07:04,090 --> 01:07:08,010
So a famous theorem says that
with sufficient parallelism,

1273
01:07:08,010 --> 01:07:11,910
workers steal very
infrequently, and this gives us

1274
01:07:11,910 --> 01:07:13,200
near-linear speedup.

1275
01:07:13,200 --> 01:07:16,230
So with sufficient
parallelism, the first term

1276
01:07:16,230 --> 01:07:19,540
in our running bound is going
to dominate the T1 over P term,

1277
01:07:19,540 --> 01:07:21,430
and that gives us
near-linear speedup.

1278
01:07:26,430 --> 01:07:32,070
Let me actually show you a
pseudoproof of this theorem.

1279
01:07:32,070 --> 01:07:34,127
And I'm allowed to
do a pseudoproof.

1280
01:07:34,127 --> 01:07:36,210
It's not actually a real
proof, but a pseudoproof.

1281
01:07:36,210 --> 01:07:37,998
So I'm allowed to do
this, because I'm not

1282
01:07:37,998 --> 01:07:39,540
the author of an
algorithms textbook.

1283
01:07:42,060 --> 01:07:43,753
So here's a pseudo proof.

1284
01:07:43,753 --> 01:07:44,500
AUDIENCE: Yet.

1285
01:07:44,500 --> 01:07:45,208
JULIAN SHUN: Yet.

1286
01:07:48,330 --> 01:07:53,170
So a processor is either working
or stealing at every time step.

1287
01:07:53,170 --> 01:07:56,310
And the total time that all
processors spend working

1288
01:07:56,310 --> 01:08:01,240
is just T1, because that's the
total work that you have to do.

1289
01:08:01,240 --> 01:08:03,940
And then when it's not
doing work, it's stealing.

1290
01:08:03,940 --> 01:08:06,870
And each steal has
a 1 over P chance

1291
01:08:06,870 --> 01:08:09,780
of reducing the span by 1,
because one of the processors

1292
01:08:09,780 --> 01:08:14,187
is contributing to the longest
path in the compilation dag.

1293
01:08:14,187 --> 01:08:15,770
And there's a 1 over
P chance that I'm

1294
01:08:15,770 --> 01:08:17,609
going to pick that
processor and steal

1295
01:08:17,609 --> 01:08:19,590
some work from that
processor and reduce

1296
01:08:19,590 --> 01:08:23,550
the span of my remaining
computation by 1.

1297
01:08:23,550 --> 01:08:26,040
And therefore, the
expected cost of all steals

1298
01:08:26,040 --> 01:08:28,439
is going to be order
P times T infinity,

1299
01:08:28,439 --> 01:08:31,260
because I have to steal P
things in expectation before I

1300
01:08:31,260 --> 01:08:37,740
get to the processor that
has the critical path.

1301
01:08:37,740 --> 01:08:42,840
And therefore, my overall costs
for stealing is order P times T

1302
01:08:42,840 --> 01:08:46,370
infinity, because I'm going
to do this T infinity times.

1303
01:08:46,370 --> 01:08:48,810
And since there
are P processors,

1304
01:08:48,810 --> 01:08:52,200
I'm going to divide
the expected time by P,

1305
01:08:52,200 --> 01:08:57,915
so T1 plus O of P times
T infinity divided by P,

1306
01:08:57,915 --> 01:08:59,540
and that's going to
give me the bound--

1307
01:08:59,540 --> 01:09:03,670
T1 over P plus order T infinity.

1308
01:09:03,670 --> 01:09:08,490
So this pseudoproof here ignores
issues with independence,

1309
01:09:08,490 --> 01:09:10,140
but it still gives
you an intuition

1310
01:09:10,140 --> 01:09:14,490
of why we get this
expected running time.

1311
01:09:14,490 --> 01:09:16,407
If you want to actually
see the full proof,

1312
01:09:16,407 --> 01:09:17,740
it's actually quite interesting.

1313
01:09:17,740 --> 01:09:21,910
It uses random variables and
tail bounds of distributions.

1314
01:09:21,910 --> 01:09:24,805
And this is the
paper that has this.

1315
01:09:24,805 --> 01:09:28,115
This is by Blumofe
and Charles Leiserson.

1316
01:09:34,189 --> 01:09:36,859
So another thing I
want to talk about

1317
01:09:36,859 --> 01:09:40,540
is that Cilk supports
C's rules for pointers.

1318
01:09:40,540 --> 01:09:43,970
So a pointer to a stack space
can be passed from a parent

1319
01:09:43,970 --> 01:09:47,450
to a child, but not from
a child to a parent.

1320
01:09:47,450 --> 01:09:51,590
And this is the same as the
stack rule for sequential C

1321
01:09:51,590 --> 01:09:53,170
programs.

1322
01:09:53,170 --> 01:09:56,910
So let's say I have this
computation on the left here.

1323
01:09:56,910 --> 01:10:00,440
So A is going to spawn
off B, and then it's

1324
01:10:00,440 --> 01:10:03,170
going to continue
executing C. In then C

1325
01:10:03,170 --> 01:10:07,160
is going to spawn
off D and execute E.

1326
01:10:07,160 --> 01:10:10,400
So we see on the right hand
side the views of the stacks

1327
01:10:10,400 --> 01:10:12,800
for each of the tasks here.

1328
01:10:12,800 --> 01:10:15,110
So A sees its own stack.

1329
01:10:15,110 --> 01:10:17,780
B sees its own
stack, but it also

1330
01:10:17,780 --> 01:10:20,990
sees A's stack, because
A is its parent.

1331
01:10:20,990 --> 01:10:23,450
C will see its own
stack, but again, it

1332
01:10:23,450 --> 01:10:25,810
sees A's stack, because
A is its parent.

1333
01:10:25,810 --> 01:10:28,940
And then finally, D and E,
they see the stack of C,

1334
01:10:28,940 --> 01:10:30,380
and they also see
the stack of A.

1335
01:10:30,380 --> 01:10:33,380
So in general, a task
can see the stack

1336
01:10:33,380 --> 01:10:36,770
of all of its ancestors
in this computation graph.

1337
01:10:40,190 --> 01:10:43,010
And we call this a
cactus stack, because it

1338
01:10:43,010 --> 01:10:47,630
sort of looks like a cactus,
if you draw this upside down.

1339
01:10:47,630 --> 01:10:50,180
And Cilk's cactus stack
supports multiple views

1340
01:10:50,180 --> 01:10:51,800
of the stacks in
parallel, and this

1341
01:10:51,800 --> 01:10:59,010
is what makes the parallel
calls to functions work in C.

1342
01:10:59,010 --> 01:11:04,200
We can also bound the stack
space used by a Cilk program.

1343
01:11:04,200 --> 01:11:07,410
So let's let S sub 1 be
the stack space required

1344
01:11:07,410 --> 01:11:11,760
by the serial execution
of a Cilk program.

1345
01:11:11,760 --> 01:11:15,420
Then the stack space required
by a P-processor execution

1346
01:11:15,420 --> 01:11:19,050
is going to be
bounded by P times S1.

1347
01:11:19,050 --> 01:11:21,060
So SP is the stack
space required

1348
01:11:21,060 --> 01:11:23,370
by a P-processor execution.

1349
01:11:23,370 --> 01:11:27,900
That's less than or
equal to P times S1.

1350
01:11:27,900 --> 01:11:30,900
Here's a high-level proof
of why this is true.

1351
01:11:30,900 --> 01:11:33,480
So it turns out that the
work-stealing algorithm in Cilk

1352
01:11:33,480 --> 01:11:36,990
maintains what's called
the busy leaves property.

1353
01:11:36,990 --> 01:11:41,670
And this says that each of the
existing leaves that are still

1354
01:11:41,670 --> 01:11:44,780
active in the computation
dag have some work

1355
01:11:44,780 --> 01:11:47,280
they're executing on it.

1356
01:11:47,280 --> 01:11:50,910
So in this example
here, the vertices

1357
01:11:50,910 --> 01:11:52,330
shaded in blue and purple--

1358
01:11:52,330 --> 01:11:55,830
these are the ones that are in
my remaining computation dag.

1359
01:11:55,830 --> 01:11:59,650
And all of the gray nodes
have already been finished.

1360
01:11:59,650 --> 01:12:01,380
And here-- for
each of the leaves

1361
01:12:01,380 --> 01:12:05,130
here, I have one processor
on that leaf executing

1362
01:12:05,130 --> 01:12:06,450
the task associated with it.

1363
01:12:06,450 --> 01:12:08,970
So Cilk guarantees this
busy leaves property.

1364
01:12:11,650 --> 01:12:14,040
And now, for each
of these processors,

1365
01:12:14,040 --> 01:12:15,840
the amount of stack
space it needs

1366
01:12:15,840 --> 01:12:18,420
is it needs the stack
space for its own task

1367
01:12:18,420 --> 01:12:22,420
and then everything above
it in this computation dag.

1368
01:12:22,420 --> 01:12:25,170
And we can actually bound
that by the stack space needed

1369
01:12:25,170 --> 01:12:30,360
by a single processor execution
of the Cilk program, S1,

1370
01:12:30,360 --> 01:12:33,690
because S1 is just the
maximum stack space we need,

1371
01:12:33,690 --> 01:12:39,900
which is basically the
longest path in this graph.

1372
01:12:39,900 --> 01:12:41,640
And we do this for
every processor.

1373
01:12:41,640 --> 01:12:45,000
So therefore, the upper
bound on the stack space

1374
01:12:45,000 --> 01:12:49,560
required by P-processor
execution is just P times S1.

1375
01:12:49,560 --> 01:12:51,960
And in general, this is a
quite loose upper bound,

1376
01:12:51,960 --> 01:12:54,420
because you're not
necessarily going

1377
01:12:54,420 --> 01:12:58,380
all the way all the way
down in this competition dag

1378
01:12:58,380 --> 01:13:01,140
every time.

1379
01:13:01,140 --> 01:13:05,320
Usually you'll be much higher
in this computation dag.

1380
01:13:05,320 --> 01:13:06,060
So any questions?

1381
01:13:06,060 --> 01:13:06,560
Yes?

1382
01:13:06,560 --> 01:13:09,810
AUDIENCE: In practice,
how much work is stolen?

1383
01:13:09,810 --> 01:13:13,643
JULIAN SHUN: In practice, if you
have enough parallelism, then

1384
01:13:13,643 --> 01:13:15,060
you're not actually
going to steal

1385
01:13:15,060 --> 01:13:17,560
that much in your algorithm.

1386
01:13:17,560 --> 01:13:20,520
So if you guarantee that
there's a lot of parallelism,

1387
01:13:20,520 --> 01:13:24,690
then each processor is going
to have a lot of its own work

1388
01:13:24,690 --> 01:13:28,650
to do, and it doesn't need
to steal very frequently.

1389
01:13:28,650 --> 01:13:31,597
But if your
parallelism is very low

1390
01:13:31,597 --> 01:13:33,180
compared to the
number of processors--

1391
01:13:33,180 --> 01:13:34,980
if it's equal to the
number of processors,

1392
01:13:34,980 --> 01:13:37,590
then you're going to spend
a significant amount of time

1393
01:13:37,590 --> 01:13:41,750
stealing, and the overheads
of the work-stealing algorithm

1394
01:13:41,750 --> 01:13:43,500
are going to show up
in your running time.

1395
01:13:43,500 --> 01:13:45,690
AUDIENCE: So I
meant in one steal--

1396
01:13:45,690 --> 01:13:48,250
like do you take
half of the deque,

1397
01:13:48,250 --> 01:13:50,035
or do you take one
element of the deque?

1398
01:13:50,035 --> 01:13:52,410
JULIAN SHUN: So the standard
Cilk work-stealing scheduler

1399
01:13:52,410 --> 01:13:55,800
takes everything at
the top of the deque up

1400
01:13:55,800 --> 01:13:57,120
until the next spawn.

1401
01:13:57,120 --> 01:13:58,950
So basically that's a strand.

1402
01:13:58,950 --> 01:13:59,847
So it takes that.

1403
01:13:59,847 --> 01:14:01,680
There are variants that
take more than that,

1404
01:14:01,680 --> 01:14:03,310
but the Cilk
work-stealing scheduler

1405
01:14:03,310 --> 01:14:04,770
that we'll be
using in this class

1406
01:14:04,770 --> 01:14:06,510
just takes the top strand.

1407
01:14:09,510 --> 01:14:11,010
Any other questions?

1408
01:14:13,720 --> 01:14:16,508
So that's actually
all I have for today.

1409
01:14:16,508 --> 01:14:18,050
If you have any
additional questions,

1410
01:14:18,050 --> 01:14:20,470
you can come talk
to us after class.

1411
01:14:20,470 --> 01:14:25,170
And remember to meet with
your MITPOSSE mentors soon.