1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare

4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:22,920 --> 00:00:26,040
PROFESSOR: So it's my pleasure
to introduce Professor Saman

9
00:00:26,040 --> 00:00:29,170
Amarasignhe as our
guest lecturer today.

10
00:00:29,170 --> 00:00:31,530
So Saman Amarasignhe
is a professor

11
00:00:31,530 --> 00:00:33,640
in the EECS Department at MIT.

12
00:00:33,640 --> 00:00:36,300
And he's also the
associate department head.

13
00:00:36,300 --> 00:00:40,560
He's an expert in compilers,
domain specific languages,

14
00:00:40,560 --> 00:00:41,580
and autotuning.

15
00:00:41,580 --> 00:00:44,940
In fact, he was the designer
of the OpenTuner framework

16
00:00:44,940 --> 00:00:48,150
that you've been using for
your homework assignments.

17
00:00:48,150 --> 00:00:50,010
So today Saman is
going to tell us

18
00:00:50,010 --> 00:00:53,190
about some of his recent work
on domain specific languages

19
00:00:53,190 --> 00:00:54,840
and also on autotuning.

20
00:00:54,840 --> 00:00:57,840
So let's give Saman Amarasignhe
a round of applause.

21
00:00:57,840 --> 00:00:59,160
[APPLAUSE]

22
00:00:59,160 --> 00:01:01,920
SAMAN AMARASIGNHE: Thank you.

23
00:01:01,920 --> 00:01:06,930
OK, so I used to teach this
class for many, many years.

24
00:01:06,930 --> 00:01:10,740
Unfortunately, now I am
administrator, so I don't--

25
00:01:10,740 --> 00:01:13,740
Julian and Charles get to have
the fun teaching the class.

26
00:01:13,740 --> 00:01:17,243
So hopefully you
guys enjoyed your--

27
00:01:17,243 --> 00:01:19,410
now you're starting-- you
are done project one, two,

28
00:01:19,410 --> 00:01:21,800
and three and going
into project four?

29
00:01:21,800 --> 00:01:23,640
Yeah, project four
is really fun and--

30
00:01:23,640 --> 00:01:25,790
[LAUGHTER]

31
00:01:25,790 --> 00:01:28,650
It will look big and
daunting, but at the end,

32
00:01:28,650 --> 00:01:33,450
you'll enjoy it, and especially
all the amount of time people

33
00:01:33,450 --> 00:01:35,670
spend working on it.

34
00:01:35,670 --> 00:01:38,240
So I think I'm making you
scared here, more than anything.

35
00:01:38,240 --> 00:01:40,420
OK, so let's get
into the talk today.

36
00:01:40,420 --> 00:01:42,690
So I will talk to you about
domain specific languages

37
00:01:42,690 --> 00:01:44,190
and a little bit
about autotuning,

38
00:01:44,190 --> 00:01:45,720
how it leads to autotuning.

39
00:01:45,720 --> 00:01:49,050
So why is domain
specific languages?

40
00:01:49,050 --> 00:01:54,360
So we are all used to
general purpose languages

41
00:01:54,360 --> 00:01:56,190
that we use all day.

42
00:01:56,190 --> 00:01:58,530
And those languages
are set up to capture

43
00:01:58,530 --> 00:02:01,380
a very large sort
of what people might

44
00:02:01,380 --> 00:02:03,780
want to do in programming.

45
00:02:03,780 --> 00:02:06,120
However, a lot of
times there are

46
00:02:06,120 --> 00:02:11,580
specific areas, specific
domains, either some area in--

47
00:02:11,580 --> 00:02:14,130
that you want to implement, or
the certain patterns you want

48
00:02:14,130 --> 00:02:17,640
to implement code that has a
lot of interesting properties

49
00:02:17,640 --> 00:02:19,320
that in a general
purpose language

50
00:02:19,320 --> 00:02:21,600
it's very hard to describe.

51
00:02:21,600 --> 00:02:25,770
And a lot of times it's
basically very hard,

52
00:02:25,770 --> 00:02:28,620
especially from compiler point
of view, to take advantage.

53
00:02:28,620 --> 00:02:31,350
Because it has to
work for everybody.

54
00:02:31,350 --> 00:02:33,330
So domain specific
languages basically

55
00:02:33,330 --> 00:02:36,360
has this lot of
integrated benefits.

56
00:02:36,360 --> 00:02:39,750
Because if you know that you
are-- what you're building

57
00:02:39,750 --> 00:02:42,360
has a certain shape,
certain set of properties,

58
00:02:42,360 --> 00:02:46,180
if the language captured this,
and if you're building on that,

59
00:02:46,180 --> 00:02:48,150
it could be much
easier to build.

60
00:02:48,150 --> 00:02:50,190
It should have a lot of clarity.

61
00:02:50,190 --> 00:02:52,985
It's very easy to maintain
that kind of thing.

62
00:02:52,985 --> 00:02:54,660
It's very easy to test.

63
00:02:54,660 --> 00:02:57,600
And also, the other thing is,
it's very easy to understand.

64
00:02:57,600 --> 00:03:00,990
Because the domain is
very clearly described.

65
00:03:00,990 --> 00:03:03,090
If-- you can build a
library, but somebody

66
00:03:03,090 --> 00:03:05,400
can go and do weird
things in library.

67
00:03:05,400 --> 00:03:08,752
If it is built into the
language, it's set in stone.

68
00:03:08,752 --> 00:03:10,210
You can't go and
say, oh, yeah, I'm

69
00:03:10,210 --> 00:03:11,590
going to change something here.

70
00:03:11,590 --> 00:03:13,120
Let me do some weird thing here.

71
00:03:13,120 --> 00:03:14,520
It's built into the language.

72
00:03:14,520 --> 00:03:16,290
So it stays there.

73
00:03:16,290 --> 00:03:18,870
It makes it much easier for
programmers [INAUDIBLE]..

74
00:03:18,870 --> 00:03:22,170
But from my point of view,
the domain specific language

75
00:03:22,170 --> 00:03:25,740
I really like are
the languages where

76
00:03:25,740 --> 00:03:30,210
I know I can take advantage of
knowledge of the domain experts

77
00:03:30,210 --> 00:03:32,490
to get really good performance.

78
00:03:32,490 --> 00:03:33,930
So a lot of times,
domain experts

79
00:03:33,930 --> 00:03:35,793
say, ah ha, in this
domain, I can do--

80
00:03:35,793 --> 00:03:38,460
OK, there's some linear algebra,
but I know this kind of algebra

81
00:03:38,460 --> 00:03:40,890
that I can do to
simplify the expression.

82
00:03:40,890 --> 00:03:42,730
That algebra might only
work on that domain.

83
00:03:42,730 --> 00:03:47,700
It's very hard to put some
complex algebra into C++ or C.

84
00:03:47,700 --> 00:03:49,980
But in that domain, I can
say, ha, I can call it up.

85
00:03:49,980 --> 00:03:54,450
So you can write any expression
that I can simplify it.

86
00:03:54,450 --> 00:03:58,200
And also, there are a lot
of idioms in each domain.

87
00:03:58,200 --> 00:04:00,220
So some domain
might say, OK, look,

88
00:04:00,220 --> 00:04:02,940
I am going to represent a graph
that I'm going to talk about.

89
00:04:02,940 --> 00:04:06,270
In the normal C++, you
create a bunch of classes.

90
00:04:06,270 --> 00:04:07,980
You do these very
complicated things.

91
00:04:07,980 --> 00:04:09,720
The idiom is hidden in there.

92
00:04:09,720 --> 00:04:12,870
First of all, C++ doesn't know
that I had to look for graphs.

93
00:04:12,870 --> 00:04:15,000
But even if you had
look for graphs,

94
00:04:15,000 --> 00:04:17,820
you can try graphs in
hundreds of millions of ways.

95
00:04:17,820 --> 00:04:20,130
But if it is a first class
supporting the language,

96
00:04:20,130 --> 00:04:22,440
I don't have to work
heroically to extract that.

97
00:04:22,440 --> 00:04:23,293
It's there.

98
00:04:23,293 --> 00:04:24,210
I can easily see that.

99
00:04:24,210 --> 00:04:27,630
So most of my compiler can be
doing useful things in there.

100
00:04:27,630 --> 00:04:29,640
And most of the
time, the other thing

101
00:04:29,640 --> 00:04:32,460
is, if you build a domain
specific language right,

102
00:04:32,460 --> 00:04:35,730
you can leave the complex,
the lower level decision,

103
00:04:35,730 --> 00:04:37,260
to the compiler.

104
00:04:37,260 --> 00:04:40,340
And if you-- C++, you might
be tempted to say, eh,

105
00:04:40,340 --> 00:04:41,400
I know some optimization.

106
00:04:41,400 --> 00:04:42,442
Let me do something here.

107
00:04:42,442 --> 00:04:45,410
Oh, let me do some of
the optimizations here.

108
00:04:45,410 --> 00:04:48,550
So I have been working on
optimization all my life.

109
00:04:48,550 --> 00:04:51,150
And a lot of times, when you
write a compile optimization

110
00:04:51,150 --> 00:04:54,750
part, you spend half of or more
than half of your time undoing

111
00:04:54,750 --> 00:04:57,860
the crazy optimization
the programmer did,

112
00:04:57,860 --> 00:04:59,100
like you guys are learning.

113
00:04:59,100 --> 00:05:00,600
So this-- you think
you know better.

114
00:05:00,600 --> 00:05:01,990
You go do something.

115
00:05:01,990 --> 00:05:04,230
And that might work well
then, but believe me,

116
00:05:04,230 --> 00:05:06,390
that code survives
20 years later.

117
00:05:06,390 --> 00:05:09,210
And 20 years later, that looks
like a really stupid thing

118
00:05:09,210 --> 00:05:10,090
to do.

119
00:05:10,090 --> 00:05:11,430
And then you look at
it and say, OK, now I

120
00:05:11,430 --> 00:05:13,055
had to undo everything
in the compiler,

121
00:05:13,055 --> 00:05:15,670
do the right thing in
the current architecture.

122
00:05:15,670 --> 00:05:20,020
And because of that, if you
capture the right level,

123
00:05:20,020 --> 00:05:22,590
I will let the compiler
do the work in here.

124
00:05:22,590 --> 00:05:24,630
And then as the
architectures keep maturing,

125
00:05:24,630 --> 00:05:27,370
as the problems keep changing,
I don't have to worry.

126
00:05:27,370 --> 00:05:29,490
I don't have to undo
these parts in here.

127
00:05:29,490 --> 00:05:34,320
So again, I'm coming to the
performance engineering class

128
00:05:34,320 --> 00:05:36,820
and telling you guys, leave the
performance to the compiler.

129
00:05:36,820 --> 00:05:39,210
But that's the nice thing,
that if the compiler

130
00:05:39,210 --> 00:05:42,570
can do the most of your
work, much nicer job.

131
00:05:42,570 --> 00:05:46,620
So don't doubt the compiler.

132
00:05:46,620 --> 00:05:50,097
So I'm going to talk
about three parts in here.

133
00:05:50,097 --> 00:05:51,930
One is three different
programming languages

134
00:05:51,930 --> 00:05:55,200
in here, domain specific
languages, GraphIt, Halide,

135
00:05:55,200 --> 00:05:58,500
and then OpenTuner, which
is not just the language,

136
00:05:58,500 --> 00:06:00,050
but the framework in here.

137
00:06:00,050 --> 00:06:03,240
And between GraphIt and Halide,
you will see some patterns.

138
00:06:03,240 --> 00:06:05,550
And then we'll see whether
you found the pattern

139
00:06:05,550 --> 00:06:07,830
that we are working on in here.

140
00:06:07,830 --> 00:06:11,310
So GraphIt, this is a product
that I worked with Julian.

141
00:06:11,310 --> 00:06:14,985
So if you have any questions
of GraphIt after today,

142
00:06:14,985 --> 00:06:16,360
you can definitely
go ask Julian.

143
00:06:16,360 --> 00:06:19,320
He knows probably more
about graphs and GraphIt--

144
00:06:19,320 --> 00:06:21,620
more about graphs than probably
anybody on this planet.

145
00:06:21,620 --> 00:06:24,980
So he's a good resource
to talk about graphs.

146
00:06:24,980 --> 00:06:29,050
So talking about graphs,
graphs everywhere.

147
00:06:29,050 --> 00:06:35,880
So if you go to something like
Google and do some search,

148
00:06:35,880 --> 00:06:38,130
Google has represented
the entire knowledge

149
00:06:38,130 --> 00:06:39,810
on the internet as a big graph.

150
00:06:39,810 --> 00:06:42,570
They have done a huge amount
of graph processing behind you.

151
00:06:42,570 --> 00:06:46,240
That is how-- what guides
your search in there.

152
00:06:46,240 --> 00:06:52,810
Or if you go, again, maps,
or something like Uber,

153
00:06:52,810 --> 00:06:54,500
it will find your directions.

154
00:06:54,500 --> 00:06:56,478
The entire road
network is a graph.

155
00:06:56,478 --> 00:06:58,520
And it's trying to find
things like shortest path

156
00:06:58,520 --> 00:07:01,010
in this graph to
give you the map.

157
00:07:01,010 --> 00:07:03,510
And if you go to a
recommendation engine to get

158
00:07:03,510 --> 00:07:06,540
a recommendation for a movie, if
you get a really cool movie you

159
00:07:06,540 --> 00:07:10,170
like, that's because there's
a huge graph between everybody

160
00:07:10,170 --> 00:07:11,910
who's-- which movie
they've watched,

161
00:07:11,910 --> 00:07:13,740
and their likings
to those movies.

162
00:07:13,740 --> 00:07:16,033
And they are looking and
comparing you to them

163
00:07:16,033 --> 00:07:16,950
and recommending that.

164
00:07:16,950 --> 00:07:19,650
That's all can be viewed
as graphs in here.

165
00:07:19,650 --> 00:07:27,000
And even if you go to an ATM
and try to do a transaction,

166
00:07:27,000 --> 00:07:29,340
there's a very fast
graph analysis back

167
00:07:29,340 --> 00:07:31,740
to say is this a fraudulent
transaction or not?

168
00:07:31,740 --> 00:07:34,380
So most of the transactions
people have done,

169
00:07:34,380 --> 00:07:36,930
all the connectivities
in the back in there,

170
00:07:36,930 --> 00:07:39,660
before the time that actually
the money pops out of your ATM

171
00:07:39,660 --> 00:07:43,170
machine, it has done a bunch of
graph processes to understand,

172
00:07:43,170 --> 00:07:45,180
OK, this seems like
a good transaction.

173
00:07:45,180 --> 00:07:46,920
So I will actually
give you the money.

174
00:07:46,920 --> 00:07:49,690
Sometimes you say-- you
get this other message that

175
00:07:49,690 --> 00:07:51,930
mean the graph
processing decided

176
00:07:51,930 --> 00:07:54,470
there might be some weird
thing going on there.

177
00:07:54,470 --> 00:07:57,510
So a lot of these things
that some of them,

178
00:07:57,510 --> 00:08:00,210
like maps and graphs has--

179
00:08:00,210 --> 00:08:03,400
maps and these transactions
have very fine latency thing

180
00:08:03,400 --> 00:08:03,900
in there.

181
00:08:03,900 --> 00:08:05,280
You have to get this
thing done right.

182
00:08:05,280 --> 00:08:06,630
You have to get good directions.

183
00:08:06,630 --> 00:08:08,240
Especially if you
take a wrong turn,

184
00:08:08,240 --> 00:08:09,990
you need to get the
next set of directions

185
00:08:09,990 --> 00:08:13,360
done very fast before you go hit
some Boston bad weird traffic.

186
00:08:13,360 --> 00:08:15,270
So these things
have to work fast.

187
00:08:15,270 --> 00:08:17,730
And other things, like
recommendations and Google

188
00:08:17,730 --> 00:08:19,620
Search, is huge graph.

189
00:08:19,620 --> 00:08:22,710
They build the entire web,
then all the recommendations

190
00:08:22,710 --> 00:08:24,660
have to do a huge
amount of processing.

191
00:08:24,660 --> 00:08:30,780
So performance matters a
lot in these applications.

192
00:08:30,780 --> 00:08:34,770
So let me dive down a
little bit deeper into show

193
00:08:34,770 --> 00:08:39,179
what graphs means, what
graph processing means.

194
00:08:39,179 --> 00:08:42,659
So one of the very well
known graph algorithms

195
00:08:42,659 --> 00:08:47,820
is called PageRank.

196
00:08:47,820 --> 00:08:49,170
Anybody knows [INAUDIBLE] page?

197
00:08:49,170 --> 00:08:51,510
How many have heard of PageRank?

198
00:08:51,510 --> 00:08:55,830
OK, what does page
stand in page rank?

199
00:08:55,830 --> 00:08:56,760
AUDIENCE: Larry Page.

200
00:08:56,760 --> 00:08:58,010
SAMAN AMARASIGNHE: Larry Page.

201
00:08:58,010 --> 00:08:59,833
So the first
algorithm Google did--

202
00:08:59,833 --> 00:09:02,250
I don't think this is anywhere
near Google at this point--

203
00:09:02,250 --> 00:09:05,350
was this algorithm, PageRank.

204
00:09:05,350 --> 00:09:07,200
So it ranked these pages.

205
00:09:07,200 --> 00:09:09,030
But it was developed
by Larry Page.

206
00:09:09,030 --> 00:09:14,400
So it depends on either page--
it's web pages or Larry Page.

207
00:09:14,400 --> 00:09:15,240
We don't know.

208
00:09:15,240 --> 00:09:18,120
But people think it's
Larry Page is PageRank.

209
00:09:18,120 --> 00:09:19,830
So you have a graph in here.

210
00:09:19,830 --> 00:09:21,420
So this graph
algorithm, what it does

211
00:09:21,420 --> 00:09:23,760
it is [INAUDIBLE]
to some iterations,

212
00:09:23,760 --> 00:09:26,980
either it's max_iter on to
some convergence in here.

213
00:09:26,980 --> 00:09:30,720
So what it first do is
it will go around, look

214
00:09:30,720 --> 00:09:36,170
at all its neighbors,
and calculate,

215
00:09:36,170 --> 00:09:40,490
basically rank a new rank
out of all my neighbors.

216
00:09:40,490 --> 00:09:42,470
So that means what is--

217
00:09:42,470 --> 00:09:43,700
how good are my neighbors?

218
00:09:43,700 --> 00:09:44,540
What's their rank?

219
00:09:44,540 --> 00:09:46,820
And what's their
contribution to me?

220
00:09:46,820 --> 00:09:49,640
So it means being
known to a good person

221
00:09:49,640 --> 00:09:52,550
and having a connection
to a very well known--

222
00:09:52,550 --> 00:09:55,628
in this case, a super web
page-- means I am highly ranked.

223
00:09:55,628 --> 00:09:57,170
So I am more
influential, because I'm

224
00:09:57,170 --> 00:09:59,040
closer to something in there.

225
00:09:59,040 --> 00:10:02,240
So what it does is,
basically, it will go--

226
00:10:02,240 --> 00:10:05,013
each node calculates some
value, and propagate to all

227
00:10:05,013 --> 00:10:06,305
the neighbors, and aggregating.

228
00:10:06,305 --> 00:10:08,990
So entire graph
participating in that.

229
00:10:08,990 --> 00:10:11,060
And then, what
happens is each node

230
00:10:11,060 --> 00:10:14,540
will go about calculating
its new rank in there.

231
00:10:14,540 --> 00:10:17,760
From looking at old rank,
it get modified a little bit

232
00:10:17,760 --> 00:10:18,950
towards a new rank.

233
00:10:18,950 --> 00:10:20,990
And then they swap old
ranks and new ranks.

234
00:10:20,990 --> 00:10:24,040
So this is the two computations
that you iterate over that.

235
00:10:24,040 --> 00:10:26,690
And you have to do for
the entire graph in here.

236
00:10:26,690 --> 00:10:28,050
So, of course, you can run this.

237
00:10:28,050 --> 00:10:30,050
This will run very, very
slowly if you run this.

238
00:10:30,050 --> 00:10:31,780
So if you want to
get performance,

239
00:10:31,780 --> 00:10:33,590
you write this piece of code.

240
00:10:33,590 --> 00:10:39,480
So this piece of code,
basically, is huge.

241
00:10:39,480 --> 00:10:42,500
And it runs 23 times
faster than what's

242
00:10:42,500 --> 00:10:44,950
in the previous graph in
here on a 12 core machine.

243
00:10:44,950 --> 00:10:47,210
It basically had
multi-threaded so we'd

244
00:10:47,210 --> 00:10:48,740
get parallel performance.

245
00:10:48,740 --> 00:10:51,010
It is load balanced
because, as you know,

246
00:10:51,010 --> 00:10:53,030
graphs are very unbalanced.

247
00:10:53,030 --> 00:10:54,140
So you get load balance.

248
00:10:54,140 --> 00:10:56,780
If you have non-uniform
memory access machines, things

249
00:10:56,780 --> 00:10:59,950
like multiple socket machines,
it will take advantage of that.

250
00:10:59,950 --> 00:11:01,130
It advantage of caches--

251
00:11:01,130 --> 00:11:04,160
a lot of things happening
in this piece of code.

252
00:11:04,160 --> 00:11:07,088
But, of course, you know
this is hard to write

253
00:11:07,088 --> 00:11:07,880
this piece of code.

254
00:11:07,880 --> 00:11:11,445
But also, worse, you
might not know what to do,

255
00:11:11,445 --> 00:11:13,570
what the right optimizing--
you might only iterate.

256
00:11:13,570 --> 00:11:14,870
You might try many things.

257
00:11:14,870 --> 00:11:16,460
And this is very hard.

258
00:11:16,460 --> 00:11:18,360
Every time you change
something, if you say,

259
00:11:18,360 --> 00:11:21,220
ah, I want to do something
a little bit different,

260
00:11:21,220 --> 00:11:23,510
that I had to write a very
complicated piece of code,

261
00:11:23,510 --> 00:11:28,670
get it all right, get everything
working before I test in here.

262
00:11:28,670 --> 00:11:33,420
So this is why we can
use a DSL for this one.

263
00:11:33,420 --> 00:11:36,890
So let me go a little bit,
talk about graph algorithms

264
00:11:36,890 --> 00:11:41,120
and say this seems like
a new set of [INAUDIBLE]..

265
00:11:41,120 --> 00:11:43,243
So what do people
do with graphs?

266
00:11:43,243 --> 00:11:44,660
So when they say
graph algorithms,

267
00:11:44,660 --> 00:11:46,280
I'm going to go a
little bit deep down

268
00:11:46,280 --> 00:11:50,570
to show you what type of
things represent these graphs.

269
00:11:50,570 --> 00:11:52,190
There's one class
of graph algorithms

270
00:11:52,190 --> 00:11:54,560
that are called
topology-driven algorithms.

271
00:11:54,560 --> 00:11:57,440
That means the entire
graph participates

272
00:11:57,440 --> 00:11:59,690
in the computation.

273
00:11:59,690 --> 00:12:02,240
For example, Google Search--

274
00:12:02,240 --> 00:12:04,290
before you do Google
Search, it will

275
00:12:04,290 --> 00:12:09,160
do the entire basic collection
of all the web links.

276
00:12:09,160 --> 00:12:11,120
It will build this
huge graph and do

277
00:12:11,120 --> 00:12:14,240
huge amount of processing
to basically able

278
00:12:14,240 --> 00:12:16,220
to do the search in here.

279
00:12:16,220 --> 00:12:19,460
Recommendation engine-- so
every, probably few weeks,

280
00:12:19,460 --> 00:12:21,620
or whatever it is, it
will collect everybody's

281
00:12:21,620 --> 00:12:23,212
recommendations,
have this huge data,

282
00:12:23,212 --> 00:12:25,670
and you're going to process
that and do this recommendation

283
00:12:25,670 --> 00:12:26,510
engine.

284
00:12:26,510 --> 00:12:28,340
So this is applied
for the entire graphs,

285
00:12:28,340 --> 00:12:30,480
and sometimes billions
or trillions of nodes

286
00:12:30,480 --> 00:12:33,500
have to go into
this computation.

287
00:12:33,500 --> 00:12:37,200
Another set of algorithms is
called data-driven algorithms.

288
00:12:37,200 --> 00:12:39,740
So what that means is you
start with certain nodes.

289
00:12:39,740 --> 00:12:42,620
And then you keep going to its
neighbors and its neighbors

290
00:12:42,620 --> 00:12:47,170
processing data in
here as we do that.

291
00:12:47,170 --> 00:12:52,770
And the kind of algorithms
that fit in this category

292
00:12:52,770 --> 00:12:54,750
are things like
if you have a map,

293
00:12:54,750 --> 00:12:57,000
if I had to find the shortest
path, that means I have,

294
00:12:57,000 --> 00:12:58,170
probably two paths.

295
00:12:58,170 --> 00:13:02,910
I don't have to get in from
direction from here to Boston.

296
00:13:02,910 --> 00:13:05,310
I don't have to go through
New York nodes in New York.

297
00:13:05,310 --> 00:13:07,602
I just have to go through my
neighbors connected to me.

298
00:13:07,602 --> 00:13:10,860
So I am basically
operating on a certain area

299
00:13:10,860 --> 00:13:13,980
with some connections
and processing that.

300
00:13:13,980 --> 00:13:15,480
So these are
data-driven algorithms.

301
00:13:15,480 --> 00:13:17,760
So I might have a huge graph.

302
00:13:17,760 --> 00:13:19,680
But my computation
might only work

303
00:13:19,680 --> 00:13:22,560
on a small region or a
small part of the graph

304
00:13:22,560 --> 00:13:23,886
in these algorithms.

305
00:13:27,540 --> 00:13:31,680
So when you traversing
through a graph doing that,

306
00:13:31,680 --> 00:13:34,320
there are multiple ways
of doing graph traversals.

307
00:13:34,320 --> 00:13:37,110
And this is why
optimization is hard.

308
00:13:37,110 --> 00:13:40,110
Because there are many
different ways of doing things.

309
00:13:40,110 --> 00:13:44,820
And each has different set
of outcomes you can get.

310
00:13:44,820 --> 00:13:48,030
So I see a lot of
graph algorithms.

311
00:13:48,030 --> 00:13:52,020
I need to get something
from my neighbors.

312
00:13:52,020 --> 00:13:53,910
One way to get something
my neighbors is I

313
00:13:53,910 --> 00:13:56,340
can calculate what the
neighbor-- all my neighbors

314
00:13:56,340 --> 00:13:59,980
might want and give it
to all the neighbors.

315
00:13:59,980 --> 00:14:03,145
Or I can go change all the
neighbors to update my value.

316
00:14:03,145 --> 00:14:04,020
So why do you think--

317
00:14:04,020 --> 00:14:07,312
OK, you have done some
programming in here.

318
00:14:07,312 --> 00:14:08,520
What do you think about this?

319
00:14:08,520 --> 00:14:09,320
Is this a good way?

320
00:14:09,320 --> 00:14:10,710
So if I want to
update everybody,

321
00:14:10,710 --> 00:14:13,770
I will calculate what
I will do, and I'll

322
00:14:13,770 --> 00:14:16,099
go change everybody,
all my neighbors.

323
00:14:21,089 --> 00:14:23,442
AUDIENCE: This is not as
parallel as it could be.

324
00:14:23,442 --> 00:14:25,525
SAMAN AMARASIGNHE: Not as
parallel as it could be.

325
00:14:25,525 --> 00:14:27,160
I think you are
getting to a point.

326
00:14:27,160 --> 00:14:29,650
But why is not that parallel?

327
00:14:29,650 --> 00:14:31,920
AUDIENCE: Well, if you're
doing the same thing

328
00:14:31,920 --> 00:14:35,210
with the neighbors, you might
as well just tell your neighbors

329
00:14:35,210 --> 00:14:36,620
to do the work for you.

330
00:14:36,620 --> 00:14:37,310
SAMAN AMARASIGNHE: Yeah, but--

331
00:14:37,310 --> 00:14:38,352
that's a very good point.

332
00:14:38,352 --> 00:14:40,370
So if you are doing a
data-driven, if I'm doing,

333
00:14:40,370 --> 00:14:41,150
that's not good.

334
00:14:41,150 --> 00:14:44,645
But if everybody is doing
that to their neighbor,

335
00:14:44,645 --> 00:14:45,950
so then I have parallelism.

336
00:14:45,950 --> 00:14:48,200
So everybody might say, you
are updated your neighbor.

337
00:14:48,200 --> 00:14:49,575
You are updating
your-- everybody

338
00:14:49,575 --> 00:14:50,750
is updating their neighbors.

339
00:14:50,750 --> 00:14:52,790
So now there's another
problem showing up.

340
00:14:52,790 --> 00:14:54,500
What's the problem
if everybody tried

341
00:14:54,500 --> 00:14:55,900
to update their neighbors?

342
00:14:55,900 --> 00:14:56,400
Back there.

343
00:14:56,400 --> 00:14:57,660
AUDIENCE: There's a
determinancy race.

344
00:14:57,660 --> 00:14:58,850
SAMAN AMARASIGNHE:
There's a race in there.

345
00:14:58,850 --> 00:15:00,558
Because everybody's
going right in there.

346
00:15:00,558 --> 00:15:02,870
So if you want to get
this actually right,

347
00:15:02,870 --> 00:15:05,210
you have a bunch of issues here.

348
00:15:05,210 --> 00:15:07,890
You want to basically
do atomic updates.

349
00:15:07,890 --> 00:15:09,390
Because you need
to lock that thing.

350
00:15:09,390 --> 00:15:12,560
So it has get atomically
updated in here.

351
00:15:12,560 --> 00:15:13,280
And this is nice.

352
00:15:13,280 --> 00:15:14,570
But I don't have to
traverse anything.

353
00:15:14,570 --> 00:15:16,028
Because everybody
I need to update,

354
00:15:16,028 --> 00:15:17,870
I actually go and update it.

355
00:15:17,870 --> 00:15:19,780
That's nice way to
do that, especially

356
00:15:19,780 --> 00:15:21,750
if it is not a global thing.

357
00:15:21,750 --> 00:15:24,530
So if I'm propagating, I
will update my neighbors.

358
00:15:24,530 --> 00:15:28,760
And I can propagate that down.

359
00:15:28,760 --> 00:15:32,000
Another way to do
that is pull schedule.

360
00:15:32,000 --> 00:15:33,620
That means if I--

361
00:15:33,620 --> 00:15:36,410
everybody look at-- ask their
neighbors, OK, do you have--

362
00:15:36,410 --> 00:15:37,460
what you have to--

363
00:15:37,460 --> 00:15:38,090
give it to me.

364
00:15:38,090 --> 00:15:39,882
And I collect everything
from my neighbors.

365
00:15:39,882 --> 00:15:41,930
And I update myself.

366
00:15:41,930 --> 00:15:43,420
So is there a race
condition now?

367
00:15:48,940 --> 00:15:51,220
How many people say
there is a race?

368
00:15:51,220 --> 00:15:53,650
How many people think
there's no race?

369
00:15:53,650 --> 00:15:56,020
So what happens is I'm reading
from all the neighbors.

370
00:15:56,020 --> 00:15:57,687
Everybody is reading
from the neighbors.

371
00:15:57,687 --> 00:15:59,020
But I am only updating myself.

372
00:15:59,020 --> 00:16:01,270
So because of that, I'm the
only one who's writing me.

373
00:16:01,270 --> 00:16:02,228
So I don't have a race.

374
00:16:02,228 --> 00:16:04,180
So it is really nice
you don't have a race.

375
00:16:04,180 --> 00:16:08,380
But I might-- if I'm doing a
data-driven transformation,

376
00:16:08,380 --> 00:16:10,830
I might not know that
I need to get updated.

377
00:16:10,830 --> 00:16:12,580
Because the update
comes from that person.

378
00:16:12,580 --> 00:16:14,122
And that means I
might be asking you,

379
00:16:14,122 --> 00:16:15,330
do you have anything to send?

380
00:16:15,330 --> 00:16:16,390
And you might say no.

381
00:16:16,390 --> 00:16:18,670
So in that sense,
I might basically

382
00:16:18,670 --> 00:16:21,880
doing a lot of extra
computation than necessary.

383
00:16:21,880 --> 00:16:24,970
Because I might not know that
I have data I need to get.

384
00:16:24,970 --> 00:16:28,030
But I had to ask you
whether I should do this.

385
00:16:28,030 --> 00:16:36,790
But I don't have any, basically,
need to do any synchronization.

386
00:16:36,790 --> 00:16:39,700
Another interesting thing
is I can take this graph,

387
00:16:39,700 --> 00:16:42,690
and I can basically
partition the graph.

388
00:16:42,690 --> 00:16:45,010
And once I partition
the graph, I

389
00:16:45,010 --> 00:16:47,800
can basically say, OK,
this core get this graph.

390
00:16:47,800 --> 00:16:49,540
This core get this graph.

391
00:16:49,540 --> 00:16:51,875
Or this processor
node get this graph.

392
00:16:51,875 --> 00:16:53,750
What's the advantage of
partitioning a graph?

393
00:16:56,430 --> 00:16:58,430
Why do I want to partition
a graph-- large graph

394
00:16:58,430 --> 00:16:59,450
into small pieces?

395
00:17:05,400 --> 00:17:07,200
Of course, you had to
do a good partition.

396
00:17:07,200 --> 00:17:09,339
You can't do
arbitrary partition.

397
00:17:09,339 --> 00:17:11,910
So what happens if I
do a good partitioning?

398
00:17:11,910 --> 00:17:14,430
I don't tell the word, because
then the answer comes out

399
00:17:14,430 --> 00:17:15,964
in there.

400
00:17:15,964 --> 00:17:18,089
OK, let me see if anybody
else, you have answered--

401
00:17:18,089 --> 00:17:20,520
anybody else want to answer?

402
00:17:20,520 --> 00:17:21,720
Come on.

403
00:17:21,720 --> 00:17:24,670
You have to-- [INAUDIBLE].

404
00:17:24,670 --> 00:17:28,650
What happened if I take apart,
and find two different groups,

405
00:17:28,650 --> 00:17:31,170
and separate them, and give
this one to one and this one

406
00:17:31,170 --> 00:17:31,670
to another?

407
00:17:31,670 --> 00:17:33,020
What do I get?

408
00:17:33,020 --> 00:17:34,960
AUDIENCE: You get
some parallelism.

409
00:17:34,960 --> 00:17:36,752
SAMAN AMARASIGNHE: I
get parallelism, also.

410
00:17:36,752 --> 00:17:39,270
But other thing, if I have a
lot of connected things going

411
00:17:39,270 --> 00:17:41,312
to-- these connected things
going to that person,

412
00:17:41,312 --> 00:17:44,510
what else can I get?

413
00:17:44,510 --> 00:17:46,380
Locality-- have you heard?

414
00:17:46,380 --> 00:17:48,580
Did you do locality
in the class?

415
00:17:48,580 --> 00:17:51,720
So that means-- the
partition means my--

416
00:17:51,720 --> 00:17:54,660
the thing I'm working, I am
only working on a small amount.

417
00:17:54,660 --> 00:17:57,490
And that might, if I'm
lucky, fit in my cache.

418
00:17:57,490 --> 00:17:59,730
And that would be very
nice then everybody's has

419
00:17:59,730 --> 00:18:03,690
to go to every node in here.

420
00:18:03,690 --> 00:18:07,110
So if I partition
this properly, I

421
00:18:07,110 --> 00:18:09,200
will get good locality in here.

422
00:18:09,200 --> 00:18:13,120
It's actually written
there, whoops.

423
00:18:13,120 --> 00:18:15,120
So my answer was in there,
so improved locality.

424
00:18:15,120 --> 00:18:18,110
But, of course, now I might have
a little bit extra overhead.

425
00:18:18,110 --> 00:18:20,830
Because I know I might have
to replicate some nodes,

426
00:18:20,830 --> 00:18:21,510
stuff like that.

427
00:18:21,510 --> 00:18:22,680
Because it's in both sides.

428
00:18:27,980 --> 00:18:31,250
So another interesting
in properties of graphs

429
00:18:31,250 --> 00:18:33,620
is when you look at data
structures until now,

430
00:18:33,620 --> 00:18:35,795
things like arrays,
the size matters.

431
00:18:35,795 --> 00:18:37,670
These represent what
array fits in the cache,

432
00:18:37,670 --> 00:18:38,700
and stuff like that.

433
00:18:38,700 --> 00:18:40,940
Graphs, there are
some other properties

434
00:18:40,940 --> 00:18:44,270
of the graphs in here.

435
00:18:44,270 --> 00:18:48,030
So if you go to
social networks--

436
00:18:48,030 --> 00:18:50,030
social network is a graph--

437
00:18:50,030 --> 00:18:52,700
what's the interesting
property in social networks

438
00:18:52,700 --> 00:18:53,510
you have observed?

439
00:18:58,180 --> 00:18:59,180
AUDIENCE: Connectedness.

440
00:18:59,180 --> 00:19:00,240
SAMAN AMARASIGNHE:
Connectedness--

441
00:19:00,240 --> 00:19:02,010
there are people
like me that probably

442
00:19:02,010 --> 00:19:04,350
have 20 friends in
there and has a very

443
00:19:04,350 --> 00:19:05,590
little number of connections.

444
00:19:05,590 --> 00:19:07,590
And then there are
celebrities who have millions

445
00:19:07,590 --> 00:19:09,120
of connections in here.

446
00:19:09,120 --> 00:19:11,790
So the interesting thing is,
if you look at a social network

447
00:19:11,790 --> 00:19:16,260
graph, you have this
relationship called power law

448
00:19:16,260 --> 00:19:17,520
relationship.

449
00:19:17,520 --> 00:19:19,880
That means-- there's
exponential code.

450
00:19:19,880 --> 00:19:24,960
There are some people here,
like very well-known celebrities

451
00:19:24,960 --> 00:19:28,410
that might have millions and
millions of users in here--

452
00:19:28,410 --> 00:19:30,390
connections in
neighbors, or likes,

453
00:19:30,390 --> 00:19:32,252
or whatever it is in that node.

454
00:19:32,252 --> 00:19:33,960
And there are people
like me sitting here

455
00:19:33,960 --> 00:19:36,420
that has very little
people connected

456
00:19:36,420 --> 00:19:37,540
to the rest of the world.

457
00:19:37,540 --> 00:19:40,200
So this is normally-- people
have observed these big social

458
00:19:40,200 --> 00:19:41,208
network type graphs--

459
00:19:41,208 --> 00:19:43,125
you have this kind of
exponential relationship

460
00:19:43,125 --> 00:19:44,490
in here.

461
00:19:44,490 --> 00:19:47,430
So the web has
exponential relationship.

462
00:19:47,430 --> 00:19:50,220
A social network has this
kind of relationship in there.

463
00:19:50,220 --> 00:19:52,950
So those things you have to
do very interesting things

464
00:19:52,950 --> 00:19:54,407
when you process these graphs.

465
00:19:54,407 --> 00:19:56,490
Because there are certain
connections that matter,

466
00:19:56,490 --> 00:19:58,560
certain nodes that
matter a lot, or has

467
00:19:58,560 --> 00:19:59,850
a big impact than other nodes.

468
00:20:02,930 --> 00:20:05,040
Then there are other
graphs that have

469
00:20:05,040 --> 00:20:06,290
a bounded-degree distribution.

470
00:20:06,290 --> 00:20:10,070
If you have a road network,
the maximum connection,

471
00:20:10,070 --> 00:20:12,080
probably you might
have an intersection

472
00:20:12,080 --> 00:20:14,490
that has six roads coming
to together in there.

473
00:20:14,490 --> 00:20:16,920
You don't have a million roads
connecting into one place

474
00:20:16,920 --> 00:20:17,670
anywhere in there.

475
00:20:17,670 --> 00:20:18,750
So that doesn't happen.

476
00:20:18,750 --> 00:20:21,320
So this is a lot more
flatter, a lot more

477
00:20:21,320 --> 00:20:24,107
bounded-degree distribution
graphs in here.

478
00:20:24,107 --> 00:20:25,940
They have lots of
excellent locality in here

479
00:20:25,940 --> 00:20:29,450
because, of course, all
the roads in Cambridge

480
00:20:29,450 --> 00:20:30,290
might be connected.

481
00:20:30,290 --> 00:20:32,420
But roads in Cambridge
can be separated

482
00:20:32,420 --> 00:20:34,173
from roads in New York City.

483
00:20:34,173 --> 00:20:35,340
So there they are separated.

484
00:20:35,340 --> 00:20:36,760
They are locality--
nice locality

485
00:20:36,760 --> 00:20:37,890
in these kind of graphs.

486
00:20:37,890 --> 00:20:41,120
So even the-- if
you say the graph be

487
00:20:41,120 --> 00:20:43,040
the same size, the
shape of the graph

488
00:20:43,040 --> 00:20:44,720
matters in computation,
a lot of times.

489
00:20:47,820 --> 00:20:50,360
So what happens is
now when you want

490
00:20:50,360 --> 00:20:51,880
to operate on these
graphs, you have

491
00:20:51,880 --> 00:20:55,500
to look at three
interesting properties.

492
00:20:55,500 --> 00:20:58,010
One property is, OK,
how much parallelism

493
00:20:58,010 --> 00:21:00,080
my algorithm, what I'm
trying to do to this graph

494
00:21:00,080 --> 00:21:02,270
is going to get?

495
00:21:02,270 --> 00:21:03,760
It's like a
Goldilocks type thing.

496
00:21:03,760 --> 00:21:05,540
You don't want too
much parallelism.

497
00:21:05,540 --> 00:21:07,530
If you say, I have
algorithm that huge amount

498
00:21:07,530 --> 00:21:10,517
of parallelism, if I can't take
advantage, it's not useful.

499
00:21:10,517 --> 00:21:12,350
So you need to get a
parallelism good enough

500
00:21:12,350 --> 00:21:14,570
that I can actually use it.

501
00:21:14,570 --> 00:21:17,240
Then I really like
to have locality.

502
00:21:17,240 --> 00:21:21,260
Because if I have a locality,
my caches will work.

503
00:21:21,260 --> 00:21:22,460
Everything will be nearby.

504
00:21:22,460 --> 00:21:23,772
I can get-- runs things fast.

505
00:21:23,772 --> 00:21:26,230
If I, every time, I have to
get something from main memory,

506
00:21:26,230 --> 00:21:27,460
it can be very, very slow.

507
00:21:27,460 --> 00:21:29,630
So I want to get locality.

508
00:21:29,630 --> 00:21:31,250
But the interesting
thing about graphs

509
00:21:31,250 --> 00:21:34,550
is to get localities
and get some of these,

510
00:21:34,550 --> 00:21:37,050
you might have to
do some extra work.

511
00:21:37,050 --> 00:21:39,350
So if you saw that
graph got divided

512
00:21:39,350 --> 00:21:42,590
into two different graphs, I
had to add extra nodes in here.

513
00:21:42,590 --> 00:21:44,340
I might write some
extra data structures,

514
00:21:44,340 --> 00:21:45,800
so do some extra computation.

515
00:21:45,800 --> 00:21:48,380
So I might have to do
some extra work in here.

516
00:21:48,380 --> 00:21:52,190
So in certain things, I might
not be that work efficient.

517
00:21:52,190 --> 00:21:54,870
So I might get really good
parallelism and locality,

518
00:21:54,870 --> 00:21:56,280
but I am doing too much work.

519
00:21:56,280 --> 00:21:58,880
So, for example, if I want to--

520
00:21:58,880 --> 00:22:04,375
assume I want to find
one node's neighbor,

521
00:22:04,375 --> 00:22:05,750
very way to get
good parallelism,

522
00:22:05,750 --> 00:22:08,730
everybody finds their neighbor.

523
00:22:08,730 --> 00:22:09,995
OK, but that's not efficient.

524
00:22:09,995 --> 00:22:11,870
I mean, most of the
computation's not useful.

525
00:22:11,870 --> 00:22:13,880
So there, you can do
things that you are doing

526
00:22:13,880 --> 00:22:15,590
extra work than necessary.

527
00:22:15,590 --> 00:22:17,930
Then that can get much
faster other things.

528
00:22:17,930 --> 00:22:20,870
But you have to be
careful on doing that.

529
00:22:20,870 --> 00:22:22,850
So you have this
balance in there.

530
00:22:22,850 --> 00:22:26,720
So certain algorithms will fit
in different places in here

531
00:22:26,720 --> 00:22:27,690
in this tradeoff space.

532
00:22:27,690 --> 00:22:29,565
So push algorithm
will fit in here.

533
00:22:29,565 --> 00:22:31,190
So, for example, if
you go to something

534
00:22:31,190 --> 00:22:33,470
like a pull algorithm,
what you might find

535
00:22:33,470 --> 00:22:36,560
is you are doing
less work efficient.

536
00:22:36,560 --> 00:22:38,550
Because you might do a
little bit more work.

537
00:22:38,550 --> 00:22:41,760
But it might be better in
locality and parallelism,

538
00:22:41,760 --> 00:22:44,030
because you don't have
to do locks in here.

539
00:22:44,030 --> 00:22:46,310
And then you do something
like partitioning.

540
00:22:46,310 --> 00:22:48,620
It gets really good
locality in partitioning.

541
00:22:48,620 --> 00:22:50,000
But you are doing extra work.

542
00:22:50,000 --> 00:22:51,500
And also, because
in your partition,

543
00:22:51,500 --> 00:22:53,820
you might limit your
parallelism in here.

544
00:22:53,820 --> 00:22:55,460
So you might less
parallelism, but you

545
00:22:55,460 --> 00:22:56,780
get really good locality.

546
00:22:56,780 --> 00:23:00,080
So all this is basically
large tradeoff space in here.

547
00:23:00,080 --> 00:23:02,510
And then when you keep
adding more and more things

548
00:23:02,510 --> 00:23:06,870
you can do, it fits into
this big tradeoff space.

549
00:23:06,870 --> 00:23:10,250
So how do you decide what to go
in the tradeoff space is a very

550
00:23:10,250 --> 00:23:11,930
important thing-- decision.

551
00:23:11,930 --> 00:23:13,450
So it depends on the graphs.

552
00:23:13,450 --> 00:23:16,220
If you have power law graphs,
you might want to do something.

553
00:23:16,220 --> 00:23:21,170
If you have a more
limited distributed graph,

554
00:23:21,170 --> 00:23:22,742
you want to do something else.

555
00:23:22,742 --> 00:23:24,200
And the power law
graphs, sometimes

556
00:23:24,200 --> 00:23:27,560
you might do something different
for the high connected edges

557
00:23:27,560 --> 00:23:28,190
versus others.

558
00:23:28,190 --> 00:23:30,750
You might not even
differentiate between that.

559
00:23:30,750 --> 00:23:32,520
It depends on the algorithm.

560
00:23:32,520 --> 00:23:34,222
So if you are doing--

561
00:23:34,222 --> 00:23:36,680
visiting all the nodes, whereas
as a data-driven algorithm,

562
00:23:36,680 --> 00:23:38,490
you might do
something different.

563
00:23:38,490 --> 00:23:41,370
It also depends on the
hardware you're running.

564
00:23:41,370 --> 00:23:45,380
So, for example, if you
are doing a Google search,

565
00:23:45,380 --> 00:23:46,970
basically indexing,
you're running

566
00:23:46,970 --> 00:23:50,450
an algorithm that has to operate
on the entire graph in here.

567
00:23:50,450 --> 00:23:52,910
And the graph is a
power law graph in that.

568
00:23:52,910 --> 00:23:54,840
And you're running on a cluster.

569
00:23:54,840 --> 00:23:56,510
So the right thing
might be something

570
00:23:56,510 --> 00:23:59,330
like a pull schedule with some
partitioning and something

571
00:23:59,330 --> 00:24:01,010
like a vertex
parallel, or some kind

572
00:24:01,010 --> 00:24:02,930
of a parallelism scheme
in here might give you

573
00:24:02,930 --> 00:24:05,330
the best performance.

574
00:24:05,330 --> 00:24:06,830
But in the other
side of the Google,

575
00:24:06,830 --> 00:24:08,300
if you're trying to do
a map, and if you're

576
00:24:08,300 --> 00:24:10,010
trying to give you
directions, you

577
00:24:10,010 --> 00:24:12,620
have a very different
type of a graph.

578
00:24:12,620 --> 00:24:14,840
You are doing a data-driven
algorithm in that graph.

579
00:24:14,840 --> 00:24:16,640
And you might be running
on a single machine.

580
00:24:16,640 --> 00:24:18,057
Because you need
to give direction

581
00:24:18,057 --> 00:24:20,148
fast for each individual time.

582
00:24:20,148 --> 00:24:22,190
You might have a very
different type of algorithm

583
00:24:22,190 --> 00:24:23,690
you want to run
this graph, the push

584
00:24:23,690 --> 00:24:26,660
algorithm in vertex
parallel, perhaps,

585
00:24:26,660 --> 00:24:28,550
some combination in there.

586
00:24:28,550 --> 00:24:31,910
And, of course, if you get a
bad algorithm or bad set of--

587
00:24:31,910 --> 00:24:34,610
way of doing it,
you can be very bad.

588
00:24:34,610 --> 00:24:36,770
You can get hundreds of
thousands times slower

589
00:24:36,770 --> 00:24:38,370
than the best you can achieve.

590
00:24:38,370 --> 00:24:40,700
So it matters to
find the right thing,

591
00:24:40,700 --> 00:24:42,230
right way of doing things.

592
00:24:42,230 --> 00:24:44,520
So this is where
GraphIt came in.

593
00:24:44,520 --> 00:24:46,760
GraphIt is a domain specific
language, basically,

594
00:24:46,760 --> 00:24:47,810
we developed.

595
00:24:47,810 --> 00:24:51,950
And one thing GraphIt did
was we said, OK, look,

596
00:24:51,950 --> 00:24:54,380
the algorithm is
mostly constant.

597
00:24:54,380 --> 00:24:57,410
But how you process
the-- how you go about it

598
00:24:57,410 --> 00:24:58,800
is very different.

599
00:24:58,800 --> 00:25:01,070
So we want to
separate these things.

600
00:25:01,070 --> 00:25:03,780
So the first thing
we did was come up

601
00:25:03,780 --> 00:25:06,490
with the algorithm, which is
what do you want to compute?

602
00:25:06,490 --> 00:25:07,870
It's very high level.

603
00:25:07,870 --> 00:25:10,740
It don't tell you how we are
computing that-- saying this

604
00:25:10,740 --> 00:25:11,860
is my algorithm.

605
00:25:11,860 --> 00:25:13,370
I aim to process these nodes.

606
00:25:13,370 --> 00:25:17,210
And this is the computation
I want to do in there.

607
00:25:17,210 --> 00:25:20,760
And you separate it with
an optimizational schedule

608
00:25:20,760 --> 00:25:22,030
how to compute.

609
00:25:22,030 --> 00:25:24,340
So we'd say, OK, to
do this algorithm,

610
00:25:24,340 --> 00:25:27,680
you had to do a push schedule,
do this type of parallelism--

611
00:25:27,680 --> 00:25:28,500
each separately.

612
00:25:28,500 --> 00:25:31,940
And the nice thing is that
is now, if the graph changed

613
00:25:31,940 --> 00:25:34,140
or if the matching
changed, I can give you

614
00:25:34,140 --> 00:25:36,480
a different schedule in here.

615
00:25:36,480 --> 00:25:38,670
So let me show
you some examples.

616
00:25:38,670 --> 00:25:41,130
First, look at the
algorithm in here.

617
00:25:41,130 --> 00:25:45,190
So we show three different
types of things you want to do.

618
00:25:45,190 --> 00:25:47,180
So you want to do the
entire graph in here,

619
00:25:47,180 --> 00:25:48,180
have the data-driven.

620
00:25:48,180 --> 00:25:53,040
Or I might want to just operate
on the vertices in here.

621
00:25:53,040 --> 00:25:56,070
So this one we have--

622
00:25:56,070 --> 00:25:58,610
the language provides a very
simple way of doing that.

623
00:25:58,610 --> 00:26:00,090
Language has this
function saying,

624
00:26:00,090 --> 00:26:03,180
if there are edges, all the
edges of the graph, apply--

625
00:26:03,180 --> 00:26:04,440
you can give a function.

626
00:26:04,440 --> 00:26:08,530
The function takes the,
basically, nodes and the edges,

627
00:26:08,530 --> 00:26:10,080
basically-- it to
basically carry out

628
00:26:10,080 --> 00:26:12,163
this computation, a very
simple way of doing that.

629
00:26:12,163 --> 00:26:13,530
So this is the representation.

630
00:26:13,530 --> 00:26:16,290
So the nice thing, the
simplicity of programming now.

631
00:26:16,290 --> 00:26:20,670
If I write it in C, it will look
like a big blob of ugly code.

632
00:26:20,670 --> 00:26:23,550
In the domain specific language,
all you have to write is this--

633
00:26:23,550 --> 00:26:25,350
make life very simple.

634
00:26:25,350 --> 00:26:27,060
Or if you're a
data-driven language,

635
00:26:27,060 --> 00:26:30,060
I have to say, OK, I start
with this set of vertices

636
00:26:30,060 --> 00:26:32,240
to compute in here.

637
00:26:32,240 --> 00:26:34,600
And here are the vertices
I am going to in here,

638
00:26:34,600 --> 00:26:36,472
the vertex sent here.

639
00:26:36,472 --> 00:26:37,680
And then I do some filtering.

640
00:26:37,680 --> 00:26:39,360
Because I might not
go visit everybody.

641
00:26:39,360 --> 00:26:43,072
There are some filtering
of what you can do.

642
00:26:43,072 --> 00:26:45,030
And then once you figure
out exactly the things

643
00:26:45,030 --> 00:26:46,130
you are computing,
here's a function

644
00:26:46,130 --> 00:26:47,460
to go and apply to that.

645
00:26:47,460 --> 00:26:50,640
So I can give you some very
nice way of basically subsetting

646
00:26:50,640 --> 00:26:52,510
my graph with
certain properties,

647
00:26:52,510 --> 00:26:54,770
select those things, and
now go compute there.

648
00:26:54,770 --> 00:26:56,800
And if you're only
doing vertices,

649
00:26:56,800 --> 00:26:58,860
say, OK, for each
vertices, again, I

650
00:26:58,860 --> 00:27:01,050
can filter, saying this
subset or something go

651
00:27:01,050 --> 00:27:02,250
to that computation.

652
00:27:02,250 --> 00:27:04,378
So language-wise,
it's very simple.

653
00:27:04,378 --> 00:27:05,670
This is what all you had to do.

654
00:27:05,670 --> 00:27:10,770
Now if you look at
PageRank, PageRank

655
00:27:10,770 --> 00:27:12,670
has two interesting
update functions.

656
00:27:12,670 --> 00:27:15,570
What is-- one is
updating, going--

657
00:27:15,570 --> 00:27:16,390
looking at edges.

658
00:27:16,390 --> 00:27:19,770
So what it says is new rank,
I get the destination edge.

659
00:27:19,770 --> 00:27:22,830
And it gets updated using
all the source edges in here.

660
00:27:22,830 --> 00:27:25,710
This is the update function,
very simple update function.

661
00:27:25,710 --> 00:27:29,220
And then once you do that
for each, basically, vertex,

662
00:27:29,220 --> 00:27:30,870
I go do internal update.

663
00:27:30,870 --> 00:27:34,620
I give these two functions and
put them together into driver.

664
00:27:34,620 --> 00:27:37,620
And the driver says run this
function, run this function,

665
00:27:37,620 --> 00:27:39,420
and I'm done.

666
00:27:39,420 --> 00:27:42,720
OK, so I can write this
code at higher level, much

667
00:27:42,720 --> 00:27:45,490
simpler, much nicer,
much more elegant way.

668
00:27:45,490 --> 00:27:46,990
It's much easier to understand.

669
00:27:46,990 --> 00:27:49,620
It's easier than even the
simple C++ code to understand

670
00:27:49,620 --> 00:27:52,520
what's going on if you
write it in this way.

671
00:27:52,520 --> 00:27:55,830
So this is the first advantage
of a domain specific language.

672
00:27:55,830 --> 00:27:57,030
I can do this.

673
00:27:57,030 --> 00:27:58,740
Then the next thing
you can do is now

674
00:27:58,740 --> 00:28:01,130
I can come up with the schedule.

675
00:28:01,130 --> 00:28:02,903
So schedules should
be easy to use.

676
00:28:02,903 --> 00:28:04,320
And it should be
powerful enough I

677
00:28:04,320 --> 00:28:06,720
should be able to get
the best speed possible.

678
00:28:06,720 --> 00:28:08,640
Because I can tell you
all the crazy things

679
00:28:08,640 --> 00:28:10,540
I can do to the code.

680
00:28:10,540 --> 00:28:15,660
So here's my program
here for PageRank.

681
00:28:15,660 --> 00:28:18,600
And so what I can do
is, for this algorithm,

682
00:28:18,600 --> 00:28:21,110
I can provide this
schedule in here.

683
00:28:21,110 --> 00:28:25,020
And this schedule basically
says, OK, look at this guy, s1.

684
00:28:25,020 --> 00:28:27,030
I marked it in there.

685
00:28:27,030 --> 00:28:31,150
For s1, I want to do
SparsePush type computation.

686
00:28:31,150 --> 00:28:33,780
This is how I want
to process this one.

687
00:28:33,780 --> 00:28:36,960
And then, by looking at that,
I can generate a pseudo code

688
00:28:36,960 --> 00:28:39,420
that looks like this
that basically first goes

689
00:28:39,420 --> 00:28:41,850
through a source node,
because I'm doing push

690
00:28:41,850 --> 00:28:43,173
from source to destination.

691
00:28:43,173 --> 00:28:45,090
And then I'm going through
all the destination

692
00:28:45,090 --> 00:28:46,630
nodes of that source.

693
00:28:46,630 --> 00:28:48,790
And I'm going to actually
go and update them.

694
00:28:48,790 --> 00:28:52,077
So I can do this very
simple updating here.

695
00:28:52,077 --> 00:28:53,910
But this might not get
you that performance.

696
00:28:53,910 --> 00:28:57,330
I say, ah ha, I want to
do this in parallelism.

697
00:28:57,330 --> 00:28:59,130
I want to run this parallel.

698
00:28:59,130 --> 00:29:01,470
And then when I do that, it
will automatically generate,

699
00:29:01,470 --> 00:29:04,283
say ah ha, now, I will
make this two parallel.

700
00:29:04,283 --> 00:29:05,700
And now I can't
do simple updates.

701
00:29:05,700 --> 00:29:06,950
I have to atomic add.

702
00:29:06,950 --> 00:29:08,790
So here's my atomic
add operation--

703
00:29:08,790 --> 00:29:11,010
so the graph in here.

704
00:29:11,010 --> 00:29:13,860
Then you might think, and say,
mm, do I want to do the push?

705
00:29:13,860 --> 00:29:15,420
Can I do a pull?

706
00:29:15,420 --> 00:29:19,210
So if I do a pull chain, it
will basically switch the--

707
00:29:19,210 --> 00:29:19,710
in here.

708
00:29:19,710 --> 00:29:21,690
Now I am going from
destination to source.

709
00:29:21,690 --> 00:29:23,163
I changed order in there.

710
00:29:23,163 --> 00:29:25,080
And now I don't have to
do that atomic update.

711
00:29:25,080 --> 00:29:30,270
Because I am pulling everything
to my node and updating here.

712
00:29:30,270 --> 00:29:32,280
And then, of course, if
you want to do some kind

713
00:29:32,280 --> 00:29:34,480
of partitioning, I can
also say partitioning,

714
00:29:34,480 --> 00:29:36,603
it's-- now we created
a sub-graph in here.

715
00:29:36,603 --> 00:29:38,770
And for the sub-graph, I
am doing this partitioning.

716
00:29:38,770 --> 00:29:40,620
So I can keep changing
all these things.

717
00:29:40,620 --> 00:29:42,270
Look, I didn't touch this.

718
00:29:42,270 --> 00:29:44,430
My algorithm still stays same.

719
00:29:44,430 --> 00:29:46,350
I'm changing my scheduling.

720
00:29:46,350 --> 00:29:49,710
I can play with this schedule.

721
00:29:49,710 --> 00:29:51,270
Nice thing about
that is now if you

722
00:29:51,270 --> 00:29:53,200
keep playing with
the schedule, here's

723
00:29:53,200 --> 00:29:54,450
the kind of performance I get.

724
00:29:54,450 --> 00:29:57,420
The first guy was sequential,
pretty bad performance.

725
00:29:57,420 --> 00:29:59,400
The next guy, I just
parallelized in here.

726
00:29:59,400 --> 00:30:01,760
I got some performance in here.

727
00:30:01,760 --> 00:30:04,160
But it had all the
synchronization.

728
00:30:04,160 --> 00:30:06,210
So I changed the
order of execution.

729
00:30:06,210 --> 00:30:08,180
And I got an even
better performance.

730
00:30:08,180 --> 00:30:10,430
And now I partitioned, got
[INAUDIBLE] performance.

731
00:30:10,430 --> 00:30:12,020
So this is the
order of doing that.

732
00:30:12,020 --> 00:30:14,270
But, of course, you can play
with many, many different

733
00:30:14,270 --> 00:30:15,600
combinations.

734
00:30:15,600 --> 00:30:18,890
And what GraphIt
has is huge number

735
00:30:18,890 --> 00:30:22,260
of different combinations
you can play with.

736
00:30:22,260 --> 00:30:24,380
So there are a lot of
different optimizations.

737
00:30:24,380 --> 00:30:26,150
You can do direction
optimizations,

738
00:30:26,150 --> 00:30:29,130
push, pull, doing a
sparse, dense, different

739
00:30:29,130 --> 00:30:34,610
parallelization, cache,
NUMA optimization, and also

740
00:30:34,610 --> 00:30:37,340
data layout, things like
structures of arrays,

741
00:30:37,340 --> 00:30:41,450
array of structure layout,
additional data structures that

742
00:30:41,450 --> 00:30:42,350
simplify computation.

743
00:30:42,350 --> 00:30:44,820
All these things I
can specify in here.

744
00:30:44,820 --> 00:30:46,070
And then you can play with it.

745
00:30:46,070 --> 00:30:48,480
It's not clear which one wins.

746
00:30:48,480 --> 00:30:51,170
It depends on the algorithm,
depending on the graph shape,

747
00:30:51,170 --> 00:30:53,520
graph size, depending
on the machine you run.

748
00:30:53,520 --> 00:30:56,833
So most of the time, if you
are a performance engineer,

749
00:30:56,833 --> 00:30:58,250
you'll be trying
different things,

750
00:30:58,250 --> 00:31:01,087
and looking at the
performance, and say, this

751
00:31:01,087 --> 00:31:02,420
doesn't get good cache behavior.

752
00:31:02,420 --> 00:31:03,800
OK, let me try different things.

753
00:31:03,800 --> 00:31:04,960
So you want to iterate.

754
00:31:04,960 --> 00:31:06,710
And these iterations,
you want to do fast.

755
00:31:06,710 --> 00:31:08,600
And this will do that.

756
00:31:08,600 --> 00:31:10,970
So let me tell you a
little bit of results.

757
00:31:10,970 --> 00:31:11,740
This is a--

758
00:31:11,740 --> 00:31:12,590
I have to explain.

759
00:31:12,590 --> 00:31:15,330
This a little bit of
a complicated graph.

760
00:31:15,330 --> 00:31:17,840
So what we looked at
was shown against bunch

761
00:31:17,840 --> 00:31:21,500
of different benchmarks, a
bunch of different frameworks

762
00:31:21,500 --> 00:31:22,760
that do graphs.

763
00:31:22,760 --> 00:31:27,290
So what they says is here is
a program, PageRank, ran all

764
00:31:27,290 --> 00:31:29,730
on a graph like
general graph in here.

765
00:31:29,730 --> 00:31:31,370
One means it ran the fastest.

766
00:31:31,370 --> 00:31:33,520
This ran about 8% slower.

767
00:31:33,520 --> 00:31:35,550
This ran 50% slower.

768
00:31:35,550 --> 00:31:39,860
This ran 3x slower, and
8x lower for that graph.

769
00:31:39,860 --> 00:31:42,410
The interesting thing is as
you add more different graphs,

770
00:31:42,410 --> 00:31:43,890
the performance changes.

771
00:31:43,890 --> 00:31:46,430
So, in fact, even
though we ran fastest

772
00:31:46,430 --> 00:31:50,450
for this road graph, which are
a very different type of graph,

773
00:31:50,450 --> 00:31:52,910
this framework ran the--

774
00:31:52,910 --> 00:31:54,470
provided-- got the
fastest result.

775
00:31:54,470 --> 00:31:55,850
Because the graph is different.

776
00:31:55,850 --> 00:31:58,855
So it might be doing
something that's better.

777
00:31:58,855 --> 00:32:00,230
The interesting
thing is since we

778
00:32:00,230 --> 00:32:02,420
had-- because most
of other frameworks

779
00:32:02,420 --> 00:32:05,030
will have a couple of
built in things they try.

780
00:32:05,030 --> 00:32:07,010
They don't do, give
you all this ability

781
00:32:07,010 --> 00:32:08,250
to try all this optimizing.

782
00:32:08,250 --> 00:32:09,530
They say, ah ha, I know this.

783
00:32:09,530 --> 00:32:10,070
This is really good.

784
00:32:10,070 --> 00:32:10,880
I will do that.

785
00:32:10,880 --> 00:32:13,280
It works for certain
things, not for everybody.

786
00:32:13,280 --> 00:32:18,530
And so if you look at the entire
different breadth-first search,

787
00:32:18,530 --> 00:32:22,990
connected components, shortest
path algorithms, what you find

788
00:32:22,990 --> 00:32:27,900
is some frameworks
are good sometimes.

789
00:32:27,900 --> 00:32:29,820
They might be really
bad in other times,

790
00:32:29,820 --> 00:32:32,090
to either some algorithms,
some type of data,

791
00:32:32,090 --> 00:32:33,350
they can be really bad.

792
00:32:33,350 --> 00:32:36,890
So this algorithm was really
kind of good at this data set,

793
00:32:36,890 --> 00:32:39,530
but really bad in this
data, and really kind

794
00:32:39,530 --> 00:32:41,100
of not good in this algorithm.

795
00:32:41,100 --> 00:32:43,020
We are most of the
time good all the time.

796
00:32:43,020 --> 00:32:46,850
The reason is we don't
make a few decisions.

797
00:32:46,850 --> 00:32:49,520
In GraphIt, what it will do is
it will give you this ability

798
00:32:49,520 --> 00:32:51,860
to try different things.

799
00:32:51,860 --> 00:32:55,430
And depending on the graph,
depending on the algorithm,

800
00:32:55,430 --> 00:32:58,680
some optimizations might
work better than the other.

801
00:32:58,680 --> 00:33:01,250
This is exactly what you guys
have been doing in the class.

802
00:33:01,250 --> 00:33:04,803
You are trying different
optimizations by hand.

803
00:33:04,803 --> 00:33:07,220
The difference is every time
you thought about optimizing,

804
00:33:07,220 --> 00:33:10,370
you had to go change the entire
program to make that work.

805
00:33:10,370 --> 00:33:12,780
Here you just change the
scheduling language one way,

806
00:33:12,780 --> 00:33:16,960
recompile, run, measure,
and you can do this fast.

807
00:33:16,960 --> 00:33:19,820
Any questions so far
before I switch gears?

808
00:33:24,790 --> 00:33:26,290
AUDIENCE: [INAUDIBLE]

809
00:33:26,290 --> 00:33:28,130
SAMAN AMARASIGNHE: OK.

810
00:33:28,130 --> 00:33:31,690
So I'm going to switch
to another domain

811
00:33:31,690 --> 00:33:34,560
specific language.

812
00:33:34,560 --> 00:33:37,040
You will find a lot of
simulated, lot of parallelism

813
00:33:37,040 --> 00:33:37,540
in here.

814
00:33:37,540 --> 00:33:38,707
This was intentional.

815
00:33:38,707 --> 00:33:40,540
I could have talked on
many different domain

816
00:33:40,540 --> 00:33:41,420
specific languages.

817
00:33:41,420 --> 00:33:44,770
But I took another
one that you-- almost

818
00:33:44,770 --> 00:33:47,470
has kind of a mirror
similarities of what's

819
00:33:47,470 --> 00:33:48,280
going on.

820
00:33:48,280 --> 00:33:51,120
And you will see a pattern
in here, hopefully.

821
00:33:51,120 --> 00:33:54,370
And after this, I will ask
you what the patterns are.

822
00:33:54,370 --> 00:33:56,600
This language is Halide.

823
00:33:56,600 --> 00:33:59,890
It was originally developed
for image processing.

824
00:33:59,890 --> 00:34:03,580
And its focus is-- the graphs
focus on this past graph data

825
00:34:03,580 --> 00:34:04,400
structures.

826
00:34:04,400 --> 00:34:07,210
Halide's focused
on-- because images

827
00:34:07,210 --> 00:34:09,219
are dense, regular structures.

828
00:34:09,219 --> 00:34:11,139
You do regular
computation on the images.

829
00:34:11,139 --> 00:34:12,520
And you process this thing.

830
00:34:12,520 --> 00:34:14,080
And you have a very
complex pipeline.

831
00:34:14,080 --> 00:34:15,497
Like, for example,
camera pipeline

832
00:34:15,497 --> 00:34:18,460
do many very complex
algorithms to the image

833
00:34:18,460 --> 00:34:22,719
before you get from the
bits coming out of your CCD

834
00:34:22,719 --> 00:34:25,840
to the beautiful picture
you see in Facebook.

835
00:34:30,650 --> 00:34:33,000
And the primary
goal of Halide was

836
00:34:33,000 --> 00:34:35,610
you want to match and exceed
the hand-optimized performance,

837
00:34:35,610 --> 00:34:36,110
basically.

838
00:34:36,110 --> 00:34:38,600
This was the property
we want to do.

839
00:34:38,600 --> 00:34:41,400
And we want to reduce the
rote amount of programming

840
00:34:41,400 --> 00:34:43,650
that normally a
performance engineer has

841
00:34:43,650 --> 00:34:44,965
to do to achieve this thing.

842
00:34:44,965 --> 00:34:47,340
And we want to also increase
the portability, the ability

843
00:34:47,340 --> 00:34:48,840
to take that program
from different machines

844
00:34:48,840 --> 00:34:49,500
to different.

845
00:34:49,500 --> 00:34:51,889
So let me give you an example.

846
00:34:51,889 --> 00:34:56,800
Here is a three by
three blur example.

847
00:34:56,800 --> 00:34:59,760
So what this does is this
[INAUDIBLE] two loops go

848
00:34:59,760 --> 00:35:05,190
in the x direction and do a
blur in x direction, get the--

849
00:35:05,190 --> 00:35:07,380
average the three values
next to each other.

850
00:35:07,380 --> 00:35:08,760
And then it will go--

851
00:35:08,760 --> 00:35:12,660
the result of that, do it in
y direction, and average that.

852
00:35:12,660 --> 00:35:18,840
OK, very simple filter that
you might want to do for image,

853
00:35:18,840 --> 00:35:19,860
you can run this.

854
00:35:19,860 --> 00:35:21,940
This is valid C code.

855
00:35:21,940 --> 00:35:23,820
But if you want to
get performance,

856
00:35:23,820 --> 00:35:26,730
you want to generate this guy.

857
00:35:26,730 --> 00:35:29,970
This thing, on the other
hand, ran about 11 times

858
00:35:29,970 --> 00:35:31,890
faster than this one.

859
00:35:31,890 --> 00:35:33,720
This has done tile.

860
00:35:33,720 --> 00:35:35,370
It has fused multiple loops.

861
00:35:35,370 --> 00:35:36,360
It has vectorized.

862
00:35:36,360 --> 00:35:37,650
It has multi-threaded.

863
00:35:37,650 --> 00:35:39,300
It had to do some
redundant computation

864
00:35:39,300 --> 00:35:40,840
I'll get to a little bit later.

865
00:35:40,840 --> 00:35:44,020
And it basically gives a near
roof-line optimum performance.

866
00:35:44,020 --> 00:35:47,760
That means it's using the
machine resources to this max.

867
00:35:47,760 --> 00:35:50,410
Because this has a bunch of
floating point operations.

868
00:35:50,410 --> 00:35:52,620
So it's basically
floating point unit is

869
00:35:52,620 --> 00:35:53,970
running at the max performance.

870
00:35:53,970 --> 00:35:57,450
So there's nothing much else
you could do to this one.

871
00:35:57,450 --> 00:35:59,270
But you write this thing.

872
00:35:59,270 --> 00:36:01,890
And this is not that easy.

873
00:36:01,890 --> 00:36:09,270
So this project started some
time ago with one of my--

874
00:36:09,270 --> 00:36:12,400
the person who did
it-- going to Adobe.

875
00:36:12,400 --> 00:36:14,010
He went to Adobe.

876
00:36:14,010 --> 00:36:17,670
And they had this thing
called a local laplacian

877
00:36:17,670 --> 00:36:20,190
filter in Camera
Raw, and Lightroom,

878
00:36:20,190 --> 00:36:22,160
and Photoshop projects in here.

879
00:36:22,160 --> 00:36:27,180
The reference implementation
was about 300 lines of code.

880
00:36:27,180 --> 00:36:29,460
But the implementation
that they used

881
00:36:29,460 --> 00:36:31,440
was about 1,500 lines of code.

882
00:36:31,440 --> 00:36:34,020
It took three months of
one of their best engineers

883
00:36:34,020 --> 00:36:35,190
to get to that performance.

884
00:36:35,190 --> 00:36:36,570
But it made sense.

885
00:36:36,570 --> 00:36:39,330
Because that engineer
was able to get

886
00:36:39,330 --> 00:36:44,560
10x faster by trial and
error for this piece of code.

887
00:36:44,560 --> 00:36:49,180
It's a non-trivial piece of
coding here to go do that.

888
00:36:49,180 --> 00:36:53,610
So the student, Jonathan, who's
now a professor at Berkeley,

889
00:36:53,610 --> 00:36:58,290
he basically, in one day,
in 60 lines of Halide,

890
00:36:58,290 --> 00:37:04,640
he was able to beat 2x of
Adobe code in some sense.

891
00:37:04,640 --> 00:37:07,440
And then Adobe, in
those days, didn't

892
00:37:07,440 --> 00:37:11,220
generate any code for GPUs.

893
00:37:11,220 --> 00:37:14,370
Because they decided GPUs
are changed too fast.

894
00:37:14,370 --> 00:37:17,570
And they can't keep up updating
for GPUs in every generation.

895
00:37:17,570 --> 00:37:20,160
Then they-- because of
that, they were not--

896
00:37:20,160 --> 00:37:22,112
the Adobe applications
were not using GPUs.

897
00:37:22,112 --> 00:37:24,320
So if you ran Photoshop,
it's not going to use a GPU.

898
00:37:24,320 --> 00:37:26,000
Even if you mention
it has a GPU.

899
00:37:26,000 --> 00:37:28,440
So Jonathan still had
some time left in the day.

900
00:37:28,440 --> 00:37:30,750
So he said, OK, let me
try to write on GPUs.

901
00:37:30,750 --> 00:37:32,690
So he just-- basically
the same code, he

902
00:37:32,690 --> 00:37:38,940
changed GPUs and got a 9x
faster than the fastest Adobe

903
00:37:38,940 --> 00:37:41,370
had ever had for
this piece of code.

904
00:37:41,370 --> 00:37:43,500
So how did he do it?

905
00:37:43,500 --> 00:37:48,060
Again, the key principle
here is decoupling algorithm

906
00:37:48,060 --> 00:37:49,390
from schedule.

907
00:37:49,390 --> 00:37:52,230
So algorithm, again,
is what is computed.

908
00:37:52,230 --> 00:37:54,230
And the algorithm
defined the pipeline

909
00:37:54,230 --> 00:37:57,380
of very simple pure
functions operating in there.

910
00:37:57,380 --> 00:38:00,930
And execution order,
parallelism, all those things

911
00:38:00,930 --> 00:38:02,670
is left for the schedule.

912
00:38:02,670 --> 00:38:04,590
The pipeline of
Halide just looks

913
00:38:04,590 --> 00:38:06,680
like this for the blur filter.

914
00:38:06,680 --> 00:38:10,890
It says, OK, get the
image in x dimension.

915
00:38:10,890 --> 00:38:12,680
And do it a blur
in the y dimension.

916
00:38:12,680 --> 00:38:13,180
That's all.

917
00:38:13,180 --> 00:38:14,408
And the image size is--

918
00:38:14,408 --> 00:38:16,200
because it's operating
on the entire image,

919
00:38:16,200 --> 00:38:17,980
you don't have loops in here.

920
00:38:17,980 --> 00:38:20,460
That's all you
have to say there.

921
00:38:20,460 --> 00:38:22,440
Then you have to come
up with a schedule.

922
00:38:22,440 --> 00:38:27,720
Again, the same way when
and where it's computed,

923
00:38:27,720 --> 00:38:30,420
to be simple, that you need
to be able to tell that.

924
00:38:30,420 --> 00:38:31,380
And it has be powerful.

925
00:38:31,380 --> 00:38:33,755
You need to be able to get
the hand-optimized performance

926
00:38:33,755 --> 00:38:34,930
or better by doing this.

927
00:38:38,507 --> 00:38:40,090
Something looks a
little bit familiar.

928
00:38:40,090 --> 00:38:43,350
Because it's all these
things, a lot of work

929
00:38:43,350 --> 00:38:46,410
you do performance kind
of fit into this genre.

930
00:38:46,410 --> 00:38:50,010
You need to do a trade off
between locality, parallelism,

931
00:38:50,010 --> 00:38:51,045
and redundant work.

932
00:38:51,045 --> 00:38:52,420
That's what you
look for in here.

933
00:38:55,990 --> 00:39:02,300
So let's look at the three
things you need to do.

934
00:39:02,300 --> 00:39:04,010
First, you need to
get parallelism.

935
00:39:04,010 --> 00:39:06,610
Parallelism is you need to
keep the multi-cores and vector

936
00:39:06,610 --> 00:39:09,768
units happy and
probably the GPU busy.

937
00:39:09,768 --> 00:39:11,310
But if you have too
much parallelism,

938
00:39:11,310 --> 00:39:12,532
it's not going to help you.

939
00:39:12,532 --> 00:39:14,740
I mean, nobody is going to
take advantage [INAUDIBLE]

940
00:39:14,740 --> 00:39:15,500
parallelism.

941
00:39:15,500 --> 00:39:17,208
So let's look at a
piece of code in here.

942
00:39:20,340 --> 00:39:22,120
So assume I am going
to say, I'm going

943
00:39:22,120 --> 00:39:24,000
to run all these things
parallel and all these things

944
00:39:24,000 --> 00:39:24,840
parallel afterwards.

945
00:39:24,840 --> 00:39:27,150
If you have three cores--

946
00:39:27,150 --> 00:39:29,260
great, I got a lot
more parallelism.

947
00:39:29,260 --> 00:39:31,100
I got six times parallelism.

948
00:39:31,100 --> 00:39:33,330
Hurrah, nobody's
going to use that.

949
00:39:33,330 --> 00:39:36,900
It's not that useful to get
six times parallelism in here.

950
00:39:36,900 --> 00:39:41,070
On the other hand, if you
run like this, one at a time,

951
00:39:41,070 --> 00:39:42,505
you have parallelism of one.

952
00:39:42,505 --> 00:39:43,380
That's not that good.

953
00:39:43,380 --> 00:39:45,850
Because you're going
to not use the machine.

954
00:39:45,850 --> 00:39:49,590
So what you really want
is something basically--

955
00:39:49,590 --> 00:39:51,360
OK, wait till it's
done-- that actually

956
00:39:51,360 --> 00:39:53,430
do parallelisms
of three might be

957
00:39:53,430 --> 00:39:55,110
the best way of
running that machine

958
00:39:55,110 --> 00:39:56,730
to get best performance.

959
00:39:56,730 --> 00:39:57,730
You don't want too much.

960
00:39:57,730 --> 00:39:58,680
You don't want too little.

961
00:39:58,680 --> 00:40:00,263
You want to get the
exact right thing.

962
00:40:03,520 --> 00:40:07,960
The next interesting thing
you need to get is locality.

963
00:40:07,960 --> 00:40:10,630
Normally, when you do image
processing, what you do

964
00:40:10,630 --> 00:40:13,978
is you change everything
in the image in one filter.

965
00:40:13,978 --> 00:40:16,270
Then the next filter has to
go in and change everything

966
00:40:16,270 --> 00:40:17,450
in the image.

967
00:40:17,450 --> 00:40:19,440
So what happens
if one filter ran

968
00:40:19,440 --> 00:40:21,710
through the entire
image and the next

969
00:40:21,710 --> 00:40:23,710
come and start running
through the entire image?

970
00:40:23,710 --> 00:40:26,500
What happens, basically?

971
00:40:26,500 --> 00:40:27,970
Is that good?

972
00:40:27,970 --> 00:40:29,710
I give the entire
image, say you,

973
00:40:29,710 --> 00:40:31,840
do my first color correction.

974
00:40:31,840 --> 00:40:35,150
And I will do some kind
of aberration correction

975
00:40:35,150 --> 00:40:35,650
afterwards.

976
00:40:35,650 --> 00:40:39,370
So what happens if you
do something like that?

977
00:40:39,370 --> 00:40:41,020
Entire image,
process one filter,

978
00:40:41,020 --> 00:40:43,480
then the next filter takes the
image and process the entire

979
00:40:43,480 --> 00:40:46,510
[INAUDIBLE] or whatever
multi-megapixel image--

980
00:40:48,642 --> 00:40:50,100
oh, you-- you're
on a [INAUDIBLE]..

981
00:40:50,100 --> 00:40:51,262
OK, back there.

982
00:40:51,262 --> 00:40:53,318
AUDIENCE: You end up
kicking [INAUDIBLE]..

983
00:40:53,318 --> 00:40:54,860
SAMAN AMARASIGNHE:
[INAUDIBLE] cache.

984
00:40:54,860 --> 00:40:57,350
Because if the image is large,
it doesn't fit in the cache.

985
00:40:57,350 --> 00:40:59,000
It's not that great to do this.

986
00:40:59,000 --> 00:41:01,610
You won't get things
in the cache in here.

987
00:41:01,610 --> 00:41:07,370
So assume I go like this,
processing the entire first row

988
00:41:07,370 --> 00:41:09,240
before you go to the second row.

989
00:41:09,240 --> 00:41:13,880
So what happens now here is we
need to start touch this one--

990
00:41:13,880 --> 00:41:16,100
I need to read these two values.

991
00:41:16,100 --> 00:41:17,570
And those two
are-- the last time

992
00:41:17,570 --> 00:41:21,200
I read them was way
before I started.

993
00:41:21,200 --> 00:41:22,220
So I [INAUDIBLE] them.

994
00:41:22,220 --> 00:41:23,620
I went through all the image.

995
00:41:23,620 --> 00:41:24,620
And I come back to that.

996
00:41:24,620 --> 00:41:28,613
And this distance-- by
the time I reach here,

997
00:41:28,613 --> 00:41:30,530
I might-- these two might
be out of the cache.

998
00:41:30,530 --> 00:41:33,650
And when I go back there,
oops, it's not in the cache.

999
00:41:33,650 --> 00:41:36,183
I have a problem in that.

1000
00:41:36,183 --> 00:41:37,850
So the other way, a
right way to do that

1001
00:41:37,850 --> 00:41:40,540
might be trying it this way.

1002
00:41:40,540 --> 00:41:42,800
If I run it like
this, basically, what

1003
00:41:42,800 --> 00:41:46,180
happens is as you run--

1004
00:41:46,180 --> 00:41:47,990
when I touch this,
I won't have--

1005
00:41:47,990 --> 00:41:50,060
I want to get these
three to run this thing.

1006
00:41:50,060 --> 00:41:53,210
Last time I read this
one was just before,

1007
00:41:53,210 --> 00:41:55,880
in the previous iteration.

1008
00:41:55,880 --> 00:41:57,020
So to get to that--

1009
00:41:57,020 --> 00:41:58,087
I just touched it.

1010
00:41:58,087 --> 00:42:00,420
So the next guy uses it, the
next guy, and after, after.

1011
00:42:00,420 --> 00:42:01,500
I go to my window.

1012
00:42:01,500 --> 00:42:02,750
I've never touched that again.

1013
00:42:02,750 --> 00:42:04,790
I have a really good
locality in here.

1014
00:42:04,790 --> 00:42:06,610
So I want to
operate it that way.

1015
00:42:06,610 --> 00:42:09,000
I won't get good
locality in here.

1016
00:42:09,000 --> 00:42:11,090
So redundant work is a
very interesting thing.

1017
00:42:11,090 --> 00:42:15,050
Sometimes, if you want to get
both locality and parallelism,

1018
00:42:15,050 --> 00:42:17,110
you might have to
do some extra work,

1019
00:42:17,110 --> 00:42:19,740
a little bit of extra work.

1020
00:42:19,740 --> 00:42:25,520
So assume in this one I had to
process these elements parallel

1021
00:42:25,520 --> 00:42:27,710
if I want to run these three.

1022
00:42:27,710 --> 00:42:32,250
Because these three needs all
these four elements in there.

1023
00:42:32,250 --> 00:42:34,040
These three need these four.

1024
00:42:34,040 --> 00:42:35,840
If I want to run these
two parallel in two

1025
00:42:35,840 --> 00:42:39,410
different cores, it
might be better if both

1026
00:42:39,410 --> 00:42:41,063
calculates these two values.

1027
00:42:41,063 --> 00:42:42,980
Because I don't have to
synchronize and stuff.

1028
00:42:42,980 --> 00:42:45,410
I can say, the left guy,
calculate four values.

1029
00:42:45,410 --> 00:42:46,620
And then I can do the three.

1030
00:42:46,620 --> 00:42:48,328
The right guy, calculate
the four values.

1031
00:42:48,328 --> 00:42:49,495
And then I can do the three.

1032
00:42:49,495 --> 00:42:50,570
I can do that parallelly.

1033
00:42:50,570 --> 00:42:54,470
But now the middle two guys,
these two get calculated twice.

1034
00:42:54,470 --> 00:42:56,930
Because both needs it.

1035
00:42:56,930 --> 00:43:00,590
And so what that means is--
oops, you can keep that.

1036
00:43:00,590 --> 00:43:03,348
So sometimes, to
do everything, I

1037
00:43:03,348 --> 00:43:04,890
might have to do
some redundant work.

1038
00:43:04,890 --> 00:43:08,120
So the way to look at
that is I can put this

1039
00:43:08,120 --> 00:43:11,870
into this scheduling framework.

1040
00:43:11,870 --> 00:43:14,150
I can map my
computation bandwidth.

1041
00:43:14,150 --> 00:43:16,700
That means coarse interleaving
with low locality.

1042
00:43:16,700 --> 00:43:18,870
That means I finish
everything before I go back

1043
00:43:18,870 --> 00:43:21,020
in here between two things.

1044
00:43:21,020 --> 00:43:22,700
If I run two things,
I finish this one

1045
00:43:22,700 --> 00:43:24,410
before I go to the next one.

1046
00:43:24,410 --> 00:43:27,120
Fine interleaving means I
process one element one,

1047
00:43:27,120 --> 00:43:29,900
duh duh duh duh, go back
and back and forth in here.

1048
00:43:29,900 --> 00:43:32,340
That's my two options here.

1049
00:43:32,340 --> 00:43:34,700
Other side is
storage granularity.

1050
00:43:34,700 --> 00:43:37,760
What that means is--

1051
00:43:37,760 --> 00:43:41,360
storage granularity very low
means I calculate something,

1052
00:43:41,360 --> 00:43:42,320
I don't remember.

1053
00:43:42,320 --> 00:43:46,760
Next time I want it, I
recalculate it again.

1054
00:43:46,760 --> 00:43:49,910
Very high storage granularity
means once I calculate it,

1055
00:43:49,910 --> 00:43:51,200
I will remember it forever.

1056
00:43:51,200 --> 00:43:53,630
Anytime you need that value,
I have it back for you.

1057
00:43:53,630 --> 00:43:56,210
So that means I have to
get it to you from anywhere

1058
00:43:56,210 --> 00:43:57,230
I calculated.

1059
00:43:57,230 --> 00:43:59,960
Storage granularity low means
my process, I calculate, I use,

1060
00:43:59,960 --> 00:44:00,710
I throw it out.

1061
00:44:00,710 --> 00:44:04,860
If anybody else want it,
they'll recalculate again.

1062
00:44:04,860 --> 00:44:07,310
So now you can have many
different computations

1063
00:44:07,310 --> 00:44:09,710
in different places
of this space in here.

1064
00:44:09,710 --> 00:44:11,630
So if you want to
compute something here,

1065
00:44:11,630 --> 00:44:13,850
this is the scheduling language.

1066
00:44:13,850 --> 00:44:16,370
That means I run this
one, and I run this one.

1067
00:44:16,370 --> 00:44:18,680
I have no redundant
computation, very

1068
00:44:18,680 --> 00:44:20,990
coarse grained interleaving.

1069
00:44:20,990 --> 00:44:22,970
That means I run the
entire thing, and then

1070
00:44:22,970 --> 00:44:25,310
the next entire thing.

1071
00:44:25,310 --> 00:44:27,440
You can go very fine
[INAUDIBLE] in here.

1072
00:44:27,440 --> 00:44:28,910
I'll calculate this one.

1073
00:44:28,910 --> 00:44:31,400
And I'll calculate these three
again, these three again.

1074
00:44:31,400 --> 00:44:34,430
So everything is
calculated multiple times.

1075
00:44:34,430 --> 00:44:37,370
When you need it, I recalculate
every time I need something.

1076
00:44:37,370 --> 00:44:39,140
I don't store anything in here.

1077
00:44:39,140 --> 00:44:39,770
So it's good.

1078
00:44:39,770 --> 00:44:41,420
I have a lot of locality.

1079
00:44:41,420 --> 00:44:44,695
But I'm doing a lot
of recomputation.

1080
00:44:44,695 --> 00:44:46,070
And then here,
you have something

1081
00:44:46,070 --> 00:44:47,470
like a sliding window.

1082
00:44:47,470 --> 00:44:50,832
Basically, you are not
recalculating anything.

1083
00:44:50,832 --> 00:44:52,040
But you are sliding in there.

1084
00:44:52,040 --> 00:44:53,990
You have a little
bit less parallelism.

1085
00:44:53,990 --> 00:44:56,810
And then you could capture
this entire spectrum

1086
00:44:56,810 --> 00:44:59,780
in between in here.

1087
00:44:59,780 --> 00:45:03,800
And you can get different
levels of fusion of these tiles.

1088
00:45:03,800 --> 00:45:06,230
And you can calculate--

1089
00:45:06,230 --> 00:45:08,060
so I don't recalculate
everything.

1090
00:45:08,060 --> 00:45:09,800
I recalculate a
few things in here.

1091
00:45:09,800 --> 00:45:11,480
These two get recalculated.

1092
00:45:14,120 --> 00:45:14,870
And then you can--

1093
00:45:14,870 --> 00:45:16,675
I'll go through this
fast [INAUDIBLE]..

1094
00:45:16,675 --> 00:45:18,050
You can use all
these operations.

1095
00:45:18,050 --> 00:45:19,700
So here is the
interesting thing.

1096
00:45:19,700 --> 00:45:23,360
So here is I am showing
you different schedules

1097
00:45:23,360 --> 00:45:25,470
at different points in here.

1098
00:45:25,470 --> 00:45:26,720
So I'm going to run this game.

1099
00:45:26,720 --> 00:45:28,400
This is on time.

1100
00:45:28,400 --> 00:45:30,268
So what it says
is this is doing--

1101
00:45:30,268 --> 00:45:31,810
you're going through
the first input,

1102
00:45:31,810 --> 00:45:34,640
[INAUDIBLE] the middle one,
[INAUDIBLE] the output in here.

1103
00:45:34,640 --> 00:45:37,570
So this has all locality,
lot of redundant work,

1104
00:45:37,570 --> 00:45:39,680
good patterns of locality.

1105
00:45:39,680 --> 00:45:41,180
All patterns have
not good locality.

1106
00:45:41,180 --> 00:45:43,190
In here is some kind
of intermediate thing.

1107
00:45:43,190 --> 00:45:45,410
So what it shows is
these are no good.

1108
00:45:45,410 --> 00:45:47,710
A good balance between
locality, parallelism,

1109
00:45:47,710 --> 00:45:49,920
and some redundant work
seem to do really well.

1110
00:45:49,920 --> 00:45:52,052
This guy finished the fastest.

1111
00:45:52,052 --> 00:45:54,010
So what you do is you
write different schedules

1112
00:45:54,010 --> 00:45:54,718
for these things.

1113
00:45:54,718 --> 00:45:56,040
And you keep running.

1114
00:45:56,040 --> 00:45:58,490
And we figured out
what schedule works.

1115
00:45:58,490 --> 00:46:01,250
So this is kind of
trial and error part

1116
00:46:01,250 --> 00:46:02,329
you have to do in here.

1117
00:46:09,670 --> 00:46:14,400
So if you look at
what's going on

1118
00:46:14,400 --> 00:46:18,220
in here, what you see here is--

1119
00:46:18,220 --> 00:46:21,300
there's some example-- is
bilateral filter computation

1120
00:46:21,300 --> 00:46:21,960
here.

1121
00:46:21,960 --> 00:46:26,370
What it says is the original
is about 122 lines of C++ code.

1122
00:46:26,370 --> 00:46:31,320
And you found something with
a good parallelism in here.

1123
00:46:31,320 --> 00:46:35,580
But we could write it in
32 lines of Halide in here.

1124
00:46:35,580 --> 00:46:40,530
And we were able to get
about 6x faster than CPU.

1125
00:46:40,530 --> 00:46:43,860
But the best algorithm
was somebody hand

1126
00:46:43,860 --> 00:46:46,320
wrote for the paper on GPUs.

1127
00:46:46,320 --> 00:46:48,810
And what it did it was it
gave up some parallelism

1128
00:46:48,810 --> 00:46:50,470
for much better locality.

1129
00:46:50,470 --> 00:46:52,470
And if you give up some
parallelism, much better

1130
00:46:52,470 --> 00:46:54,510
locality, because we
can optimize in that,

1131
00:46:54,510 --> 00:46:57,060
we got faster than their
handwritten algorithm.

1132
00:46:57,060 --> 00:47:00,420
So we can change something.

1133
00:47:00,420 --> 00:47:03,030
Here's, again,
another algorithm that

1134
00:47:03,030 --> 00:47:06,260
is doing segmenting in here.

1135
00:47:06,260 --> 00:47:08,070
And it was written in MATLAB.

1136
00:47:08,070 --> 00:47:11,220
And MATLAB is a lot less
lines of code, of course.

1137
00:47:11,220 --> 00:47:13,080
But in Halide, it's a
little bit more line,

1138
00:47:13,080 --> 00:47:16,080
because you're not just
calling library functions.

1139
00:47:16,080 --> 00:47:19,330
And Halide was 70 times faster.

1140
00:47:19,330 --> 00:47:23,220
And if you run into GPU verses
MATLAB, it's about 100--

1141
00:47:23,220 --> 00:47:26,250
1,000 times faster.

1142
00:47:26,250 --> 00:47:30,540
It's not because you're
running bad MATLAB loops.

1143
00:47:30,540 --> 00:47:34,260
In fact, what MATLAB did was it
called very well hand-optimized

1144
00:47:34,260 --> 00:47:35,500
libraries.

1145
00:47:35,500 --> 00:47:38,000
But the problem with calling
libraries, there's no locality.

1146
00:47:38,000 --> 00:47:40,780
I called a really fast
library for first routine.

1147
00:47:40,780 --> 00:47:42,533
It runs really fast.

1148
00:47:42,533 --> 00:47:43,950
And then you call
the next routine

1149
00:47:43,950 --> 00:47:45,908
that has to [INAUDIBLE]
the entire image again.

1150
00:47:45,908 --> 00:47:47,940
And now my image is
completely off the cache.

1151
00:47:47,940 --> 00:47:50,370
So what happens is between
these very fast libraries,

1152
00:47:50,370 --> 00:47:52,210
you're bringing the
image from cache.

1153
00:47:52,210 --> 00:47:54,330
And when you have
something like a library,

1154
00:47:54,330 --> 00:47:57,300
you can't fuse library
functions together.

1155
00:47:57,300 --> 00:47:59,010
In Halide, we can
confuse them together

1156
00:47:59,010 --> 00:48:01,322
and say, oh, I take
this line of the image,

1157
00:48:01,322 --> 00:48:03,030
and I will do everything
on that before I

1158
00:48:03,030 --> 00:48:04,150
move to the next thing.

1159
00:48:04,150 --> 00:48:06,030
So I can do much faster.

1160
00:48:06,030 --> 00:48:08,010
My feeling is each
function probably,

1161
00:48:08,010 --> 00:48:11,490
in MATLAB was faster, because
they have a handwritten really

1162
00:48:11,490 --> 00:48:12,330
fast thing.

1163
00:48:12,330 --> 00:48:14,200
But the copying of data from--

1164
00:48:14,200 --> 00:48:17,040
the moving from cache,
the cache effects,

1165
00:48:17,040 --> 00:48:20,580
was really slowing it down.

1166
00:48:20,580 --> 00:48:24,150
So here's the thing
that we showed before.

1167
00:48:24,150 --> 00:48:26,280
This is a very
complicated algorithm.

1168
00:48:26,280 --> 00:48:28,530
It's what we call a
pyramidal algorithm.

1169
00:48:28,530 --> 00:48:30,810
So what it does is you
take a [INAUDIBLE] in here.

1170
00:48:30,810 --> 00:48:33,300
And you divide it into a
bunch of blocks in here

1171
00:48:33,300 --> 00:48:35,400
in each level of pyramid.

1172
00:48:35,400 --> 00:48:40,110
And you do some computation,
do some look up, and do some

1173
00:48:40,110 --> 00:48:41,370
up sampling in here.

1174
00:48:41,370 --> 00:48:44,400
You do some addition
computation and compute that.

1175
00:48:44,400 --> 00:48:47,680
And then you create more and
more smaller and smaller images

1176
00:48:47,680 --> 00:48:48,180
in here.

1177
00:48:48,180 --> 00:48:50,180
You do-- you basically
[INAUDIBLE] image pyramid

1178
00:48:50,180 --> 00:48:50,860
in here.

1179
00:48:50,860 --> 00:48:55,380
And so to do this right,
it's not that simple.

1180
00:48:55,380 --> 00:48:57,290
What that means, in
each of these level,

1181
00:48:57,290 --> 00:48:59,850
there are different
balances you want to be.

1182
00:48:59,850 --> 00:49:02,722
If you have a lot of
data, parallelism is not

1183
00:49:02,722 --> 00:49:03,930
that important at that point.

1184
00:49:03,930 --> 00:49:05,160
Because you have
parallelism anyways.

1185
00:49:05,160 --> 00:49:07,410
You probably had to focus
a lot more on locality.

1186
00:49:07,410 --> 00:49:09,150
But when you get to
the smaller amount,

1187
00:49:09,150 --> 00:49:10,380
I think parallelism matters.

1188
00:49:10,380 --> 00:49:12,870
So you have to come up with
very interesting balances

1189
00:49:12,870 --> 00:49:13,680
between those.

1190
00:49:13,680 --> 00:49:16,390
So many, many things
to tune at every level.

1191
00:49:16,390 --> 00:49:17,550
There's not three things.

1192
00:49:17,550 --> 00:49:19,140
There's hundreds of
different levels.

1193
00:49:19,140 --> 00:49:21,100
So the nice thing
about Halide is

1194
00:49:21,100 --> 00:49:23,370
you can play with
all these things.

1195
00:49:23,370 --> 00:49:25,462
You can play with all
these different concepts

1196
00:49:25,462 --> 00:49:27,420
and figure out which
actually gives the fastest

1197
00:49:27,420 --> 00:49:28,253
performance in that.

1198
00:49:31,910 --> 00:49:36,790
So a little bit of, I
would say, bragging rights

1199
00:49:36,790 --> 00:49:41,800
in here for Halide, Halide
left MIT about, I think,

1200
00:49:41,800 --> 00:49:43,510
six years ago.

1201
00:49:43,510 --> 00:49:46,660
And right now, it's
everywhere in Google.

1202
00:49:46,660 --> 00:49:49,056
So it's on Android phones.

1203
00:49:49,056 --> 00:49:50,845
It started to Google Glass.

1204
00:49:50,845 --> 00:49:54,850
It doesn't exist
anymore, but in that--

1205
00:49:54,850 --> 00:49:57,550
and in fact, any--

1206
00:49:57,550 --> 00:50:01,030
all the images, all the videos
uploaded to YouTube right now,

1207
00:50:01,030 --> 00:50:02,620
they do front end processing.

1208
00:50:02,620 --> 00:50:05,600
And that processing pipeline
is written in Halide.

1209
00:50:05,600 --> 00:50:11,110
And they switched to Halide
because Halide code was about,

1210
00:50:11,110 --> 00:50:14,650
I think 4-5% faster than
the previous version.

1211
00:50:14,650 --> 00:50:19,480
And 4-5% faster for Google
was multi-million dollars

1212
00:50:19,480 --> 00:50:22,420
saved for them, because
there's so many videos

1213
00:50:22,420 --> 00:50:23,680
getting downloaded from that.

1214
00:50:26,290 --> 00:50:28,840
So recently, there's a
Photoshop announcement

1215
00:50:28,840 --> 00:50:31,750
that's saying they have an
IOS version of Photoshop

1216
00:50:31,750 --> 00:50:32,352
from Adobe.

1217
00:50:32,352 --> 00:50:33,310
They just announced it.

1218
00:50:33,310 --> 00:50:34,690
I don't think it's even out yet.

1219
00:50:34,690 --> 00:50:37,960
And the entire
Photoshop filters are

1220
00:50:37,960 --> 00:50:39,810
written in this new
version using Halide.

1221
00:50:46,280 --> 00:50:49,640
Qualcomm released this processor
called Snapdragon image

1222
00:50:49,640 --> 00:50:50,390
processor.

1223
00:50:50,390 --> 00:50:53,950
So they built that processor to
do image processing in there.

1224
00:50:53,950 --> 00:50:57,560
And the programming language
to program that processor

1225
00:50:57,560 --> 00:50:58,940
is basically Halide.

1226
00:50:58,940 --> 00:51:00,590
So you write the code in Halide.

1227
00:51:00,590 --> 00:51:02,450
So that is the kind
of-- the assembly level

1228
00:51:02,450 --> 00:51:05,600
that makes it available
for this in here.

1229
00:51:05,600 --> 00:51:07,410
And also, Intel is using that.

1230
00:51:07,410 --> 00:51:11,090
So there's lot of use of this
system at this point, which

1231
00:51:11,090 --> 00:51:15,560
is really fun to see academic
project getting to a point it's

1232
00:51:15,560 --> 00:51:16,910
very heavily used.

1233
00:51:16,910 --> 00:51:19,955
And part of that is
because it's very useful.

1234
00:51:19,955 --> 00:51:22,370
Because people realize
they need to optimize

1235
00:51:22,370 --> 00:51:26,270
these code, because cameras
and stuff, performance matter.

1236
00:51:26,270 --> 00:51:29,900
And it needs to look-- having
some poor engineer spending

1237
00:51:29,900 --> 00:51:32,990
months in the corner, just
trying out those things,

1238
00:51:32,990 --> 00:51:37,410
you can try the same things,
and lot more by do it faster.

1239
00:51:37,410 --> 00:51:39,360
OK so let me ask you a question.

1240
00:51:39,360 --> 00:51:44,750
So now between Halide and
GraphIt, what did you find?

1241
00:51:47,918 --> 00:51:48,960
A bunch of similarities--

1242
00:51:48,960 --> 00:51:52,590
I want to figure out are there
any interesting similarities

1243
00:51:52,590 --> 00:51:54,630
you guys found between
these two projects?

1244
00:52:04,490 --> 00:52:08,313
AUDIENCE: They both allow you to
try optimizations really fast.

1245
00:52:08,313 --> 00:52:09,730
SAMAN AMARASIGNHE:
So part of that

1246
00:52:09,730 --> 00:52:14,380
is also a lot of times
compilers are kind of black box.

1247
00:52:14,380 --> 00:52:16,150
We know everything,
just feed us,

1248
00:52:16,150 --> 00:52:18,110
we'll give you the
really fast code.

1249
00:52:18,110 --> 00:52:21,100
And the problem is,
they're never the fastest.

1250
00:52:21,100 --> 00:52:23,670
So if you really care about
performance, you get 90%.

1251
00:52:23,670 --> 00:52:25,780
Then you get really
frustrated-- now what do I do?

1252
00:52:25,780 --> 00:52:28,240
But this was, OK,
I'm not going to--

1253
00:52:28,240 --> 00:52:29,800
you are better at what to do.

1254
00:52:29,800 --> 00:52:31,510
But I'll make you life simpler.

1255
00:52:31,510 --> 00:52:33,760
So we still want the
performance engineer.

1256
00:52:33,760 --> 00:52:37,480
It's not the person who just
don't understand performance

1257
00:52:37,480 --> 00:52:38,572
feed, you get fast code.

1258
00:52:38,572 --> 00:52:40,030
We need a performance--
but we want

1259
00:52:40,030 --> 00:52:42,270
to make performance
engineers life easier.

1260
00:52:42,270 --> 00:52:44,675
So both of them, said, OK,
we need performance engineer.

1261
00:52:44,675 --> 00:52:45,550
We can't automate it.

1262
00:52:45,550 --> 00:52:47,508
We don't know how to
automate all these things.

1263
00:52:47,508 --> 00:52:49,000
There's too much complexity.

1264
00:52:49,000 --> 00:52:53,050
But we will let you, performance
engineer, explain what to do.

1265
00:52:53,050 --> 00:52:55,330
But we'll make your
life very simple.

1266
00:52:55,330 --> 00:52:57,112
What else?

1267
00:52:57,112 --> 00:52:58,534
AUDIENCE: Something
that was cool

1268
00:52:58,534 --> 00:53:00,930
was both of these
languages can do

1269
00:53:00,930 --> 00:53:03,420
algorithmic level
optimizations [INAUDIBLE],,

1270
00:53:03,420 --> 00:53:07,757
which is pretty different
from what compilers like GCC

1271
00:53:07,757 --> 00:53:08,840
are explained [INAUDIBLE].

1272
00:53:08,840 --> 00:53:10,257
SAMAN AMARASIGNHE:
Yeah, because--

1273
00:53:10,257 --> 00:53:11,810
I wouldn't say alg--

1274
00:53:11,810 --> 00:53:14,910
you can do a lot of domain
specific optimization.

1275
00:53:14,910 --> 00:53:17,320
So algorithmic optimization
is one level higher.

1276
00:53:17,320 --> 00:53:19,562
You can say, ah ha, I
have a better algorithm.

1277
00:53:19,562 --> 00:53:21,520
So, OK, I don't want to
do a quick search here.

1278
00:53:21,520 --> 00:53:22,540
I can do insertion sort.

1279
00:53:22,540 --> 00:53:24,510
Because quick
sorting and insertion

1280
00:53:24,510 --> 00:53:27,140
sort might be faster for
a certain class in there.

1281
00:53:27,140 --> 00:53:29,620
So that is level
change we don't do.

1282
00:53:29,620 --> 00:53:31,662
Or, worse yet, I can say--

1283
00:53:31,662 --> 00:53:34,120
this happens in a lot of things
in machine learning-- yeah,

1284
00:53:34,120 --> 00:53:36,153
if I just drop a
number here, I'm OK.

1285
00:53:36,153 --> 00:53:38,070
I don't have to get the
compute exactly right.

1286
00:53:38,070 --> 00:53:38,728
Oh yeah, if I--

1287
00:53:38,728 --> 00:53:40,270
I don't have to
calculate everything.

1288
00:53:40,270 --> 00:53:42,690
If I calculate for 10
people, it's good enough.

1289
00:53:42,690 --> 00:53:44,910
So that kind of
changes, you can't do.

1290
00:53:44,910 --> 00:53:46,330
Because that's very contextual.

1291
00:53:46,330 --> 00:53:48,110
Like, for example,
a lot of time,

1292
00:53:48,110 --> 00:53:50,380
if you are doing things
like machine learning,

1293
00:53:50,380 --> 00:53:52,270
there's no right answer.

1294
00:53:52,270 --> 00:53:53,710
You need to have a good answer.

1295
00:53:53,710 --> 00:53:56,500
So sometimes good means you
can not do certain things.

1296
00:53:56,500 --> 00:54:00,310
And you need to find what things
you shouldn't-- you cannot do,

1297
00:54:00,310 --> 00:54:03,160
that you get a huge benefit,
but you don't lose that much.

1298
00:54:03,160 --> 00:54:04,540
That level you can't do that.

1299
00:54:04,540 --> 00:54:06,950
That's the next level of
[INAUDIBLE] is saying, OK,

1300
00:54:06,950 --> 00:54:07,780
how do you do that?

1301
00:54:07,780 --> 00:54:09,940
How do you-- when
somebody say, OK, look,

1302
00:54:09,940 --> 00:54:13,000
I can train it for 10
iterations versus 100--

1303
00:54:13,000 --> 00:54:14,750
ah, 10 is good enough.

1304
00:54:14,750 --> 00:54:17,140
I can't-- if your code is
written to train for 100

1305
00:54:17,140 --> 00:54:19,660
iterations, I can't tell you,
oh yeah, 10 is good enough.

1306
00:54:19,660 --> 00:54:23,220
That is a decision that has to
be a lot higher level than what

1307
00:54:23,220 --> 00:54:23,720
I can make.

1308
00:54:23,720 --> 00:54:26,500
So that's a-- that-- there's
an interesting level that

1309
00:54:26,500 --> 00:54:30,200
can still exist on top of that,
which we can't automate that

1310
00:54:30,200 --> 00:54:30,700
easily.

1311
00:54:30,700 --> 00:54:33,222
But we might be
able to make, still,

1312
00:54:33,222 --> 00:54:35,680
a language, like a schedule
language, give you that option.

1313
00:54:35,680 --> 00:54:37,400
That's a cool
option to give, say

1314
00:54:37,400 --> 00:54:40,090
try some of these things that
actually change the algorithm.

1315
00:54:40,090 --> 00:54:42,970
But within the algorithm,
that means I'll still

1316
00:54:42,970 --> 00:54:44,400
give you the same answer.

1317
00:54:44,400 --> 00:54:46,128
I will try different things.

1318
00:54:51,492 --> 00:54:52,200
Any other things?

1319
00:54:52,200 --> 00:54:56,816
Any other things you guys
thought that was interesting?

1320
00:55:00,600 --> 00:55:03,540
How about from somewhere here?

1321
00:55:03,540 --> 00:55:05,470
What are the interesting
things you found?

1322
00:55:08,200 --> 00:55:08,950
Back there.

1323
00:55:08,950 --> 00:55:11,280
AUDIENCE: They both involve
a lot of trial and error.

1324
00:55:11,280 --> 00:55:12,780
SAMAN AMARASIGNHE:
Yes, both involve

1325
00:55:12,780 --> 00:55:13,822
a lot of trial and error.

1326
00:55:13,822 --> 00:55:17,230
I mean, this is the
modern computer systems.

1327
00:55:17,230 --> 00:55:19,160
Everything is
extremely complicated.

1328
00:55:19,160 --> 00:55:20,870
There's no right
way of doing things

1329
00:55:20,870 --> 00:55:23,272
when you look at this
pretty large piece of code.

1330
00:55:23,272 --> 00:55:24,730
And there might be
a lot of-- there

1331
00:55:24,730 --> 00:55:29,530
are caches, parallelism,
locality, a lot of things

1332
00:55:29,530 --> 00:55:30,895
that can go right.

1333
00:55:30,895 --> 00:55:32,770
And so you might have
to try out many things.

1334
00:55:32,770 --> 00:55:34,780
So if you know the answer,
if you come up with the,

1335
00:55:34,780 --> 00:55:36,580
I know exactly, every time,
I have the right answer,

1336
00:55:36,580 --> 00:55:37,570
that's amazing.

1337
00:55:37,570 --> 00:55:40,330
But even the best
performance person

1338
00:55:40,330 --> 00:55:42,850
might not be able to look at a
piece of code and say, ah ha,

1339
00:55:42,850 --> 00:55:43,725
I know your solution.

1340
00:55:43,725 --> 00:55:45,490
You do a lot of trial and error.

1341
00:55:45,490 --> 00:55:46,640
This kind of supports that.

1342
00:55:46,640 --> 00:55:48,430
And you probably have
figured that one out

1343
00:55:48,430 --> 00:55:49,513
for most of your projects.

1344
00:55:49,513 --> 00:55:52,120
It's not like you went and
say, ah, I know what to do.

1345
00:55:52,120 --> 00:55:54,040
You probably did many
different trials.

1346
00:55:54,040 --> 00:55:55,900
And I know from
this, lot of things,

1347
00:55:55,900 --> 00:55:58,510
you actually either have no
impact or slow down the code.

1348
00:55:58,510 --> 00:56:00,010
And you say, oh,
that didn't work.

1349
00:56:00,010 --> 00:56:02,440
And all if-- [INAUDIBLE]
you leave in your codes

1350
00:56:02,440 --> 00:56:05,120
shows all the crazy
things you've tried,

1351
00:56:05,120 --> 00:56:06,878
and nothing happened.

1352
00:56:06,878 --> 00:56:08,170
So that's an interesting thing.

1353
00:56:08,170 --> 00:56:08,670
What else?

1354
00:56:15,160 --> 00:56:17,290
Anything else, a
little bit differently

1355
00:56:17,290 --> 00:56:18,756
that you see on this one?

1356
00:56:25,100 --> 00:56:29,004
AUDIENCE: I was just wondering,
are there other similar domains

1357
00:56:29,004 --> 00:56:35,267
that don't have something
like this [INAUDIBLE]??

1358
00:56:35,267 --> 00:56:37,100
SAMAN AMARASIGNHE: So
interesting question--

1359
00:56:37,100 --> 00:56:41,390
are there any other domains that
don't have something like this?

1360
00:56:41,390 --> 00:56:43,580
People are working
on similar things

1361
00:56:43,580 --> 00:56:44,950
to machine learning these days.

1362
00:56:44,950 --> 00:56:48,380
That seems to be their
domain, and tensor flow,

1363
00:56:48,380 --> 00:56:51,350
and all those people
are trying to do--

1364
00:56:51,350 --> 00:56:54,533
to build systems like
similar-- like frameworks

1365
00:56:54,533 --> 00:56:55,450
that you can get that.

1366
00:56:57,892 --> 00:56:58,850
I mean, that's a very--

1367
00:56:58,850 --> 00:57:02,180
I think-- the way
I have operated

1368
00:57:02,180 --> 00:57:04,210
is I go talk to people.

1369
00:57:04,210 --> 00:57:07,760
And sometimes you find
this poor graduate student,

1370
00:57:07,760 --> 00:57:11,220
or postdoc who want to do
some research but spending all

1371
00:57:11,220 --> 00:57:13,882
of their time basically
optimizing their piece of code,

1372
00:57:13,882 --> 00:57:15,590
because they can't
get their performance.

1373
00:57:15,590 --> 00:57:17,150
And then that might
be a good domain.

1374
00:57:17,150 --> 00:57:19,010
You find these
people in physics.

1375
00:57:19,010 --> 00:57:21,040
You find these
people in biology.

1376
00:57:21,040 --> 00:57:22,370
And I am actually talking to--

1377
00:57:22,370 --> 00:57:24,500
because, for
example, in biology,

1378
00:57:24,500 --> 00:57:26,930
a lot of this gene
sequencing stuff is--

1379
00:57:26,930 --> 00:57:29,300
there are very similar
things you have to do.

1380
00:57:29,300 --> 00:57:31,250
But they seem to be
spending all this time

1381
00:57:31,250 --> 00:57:33,180
writing the code, and then--

1382
00:57:33,180 --> 00:57:35,390
mired in code complexity.

1383
00:57:35,390 --> 00:57:37,160
OK, can you do
something in that?

1384
00:57:37,160 --> 00:57:39,950
I mean, the key thing is this
is a good way to-- a nice thing

1385
00:57:39,950 --> 00:57:43,040
about MIT is there are good--
a lot of very smart people

1386
00:57:43,040 --> 00:57:46,115
in many different domains trying
to push the state of the art.

1387
00:57:46,115 --> 00:57:48,740
And who's spending all this time
cursing in front of a computer

1388
00:57:48,740 --> 00:57:51,590
program to get to a
point they want to do,

1389
00:57:51,590 --> 00:57:54,470
because-- not because they
don't know the algorithm,

1390
00:57:54,470 --> 00:57:57,350
because the amount of data
they have to deal with--

1391
00:57:57,350 --> 00:58:01,310
astronomy, I mean they get
these multiple telescopes,

1392
00:58:01,310 --> 00:58:02,990
that deluge of data.

1393
00:58:02,990 --> 00:58:05,360
And most of the time, they
know what they have to do.

1394
00:58:05,360 --> 00:58:07,152
They just have to--
can't write the program

1395
00:58:07,152 --> 00:58:08,070
to do it fast enough.

1396
00:58:08,070 --> 00:58:10,470
So there might be domains like
that, if you look at that.

1397
00:58:10,470 --> 00:58:12,877
And there might be domains
from application and domains

1398
00:58:12,877 --> 00:58:13,460
from patterns.

1399
00:58:13,460 --> 00:58:15,350
Like sparse matrices
or graphs are

1400
00:58:15,350 --> 00:58:18,410
patterns, which-- or not
only on a single application.

1401
00:58:18,410 --> 00:58:20,340
I mean, it works
in multiple places.

1402
00:58:20,340 --> 00:58:21,590
There might be other patterns.

1403
00:58:21,590 --> 00:58:24,167
Say, this is-- if you
want to do research,

1404
00:58:24,167 --> 00:58:26,250
this might be interesting
piece of doing research.

1405
00:58:26,250 --> 00:58:31,070
And I have spent my life finding
different domains and a bunch

1406
00:58:31,070 --> 00:58:34,160
of people that spend their
lifetime just hand hacking

1407
00:58:34,160 --> 00:58:37,400
things and telling them,
OK, let me see if we

1408
00:58:37,400 --> 00:58:41,220
can do some nice abstraction.

1409
00:58:41,220 --> 00:58:44,870
Anything that you guys
found that's interesting?

1410
00:58:44,870 --> 00:58:48,410
So to both of them,
what are-- what's

1411
00:58:48,410 --> 00:58:52,265
the space that they operated
on to optimize programs?

1412
00:58:59,195 --> 00:59:00,660
AUDIENCE: [INAUDIBLE].

1413
00:59:00,660 --> 00:59:01,350
SAMAN AMARASIGNHE:
Done for me, no.

1414
00:59:01,350 --> 00:59:03,160
What I'm saying is,
what are the things

1415
00:59:03,160 --> 00:59:04,630
that you're trying to optimize?

1416
00:59:04,630 --> 00:59:07,630
There's a nice space of
three different things--

1417
00:59:07,630 --> 00:59:11,470
parallelism, locality,
and redundant work.

1418
00:59:11,470 --> 00:59:14,620
My feeling is, as you go as a
performance engineer, that's

1419
00:59:14,620 --> 00:59:16,785
going to be your life.

1420
00:59:16,785 --> 00:59:18,160
If I add additional
things, there

1421
00:59:18,160 --> 00:59:20,827
might be algorithmic things that
you completely get rid of well.

1422
00:59:20,827 --> 00:59:22,690
But most of the time, we are--

1423
00:59:22,690 --> 00:59:24,490
all of you will be
[INAUDIBLE] performance

1424
00:59:24,490 --> 00:59:28,100
will be working on some kind
of multi-core vector GPU type

1425
00:59:28,100 --> 00:59:28,600
units.

1426
00:59:28,600 --> 00:59:30,220
You have to get parallelism.

1427
00:59:30,220 --> 00:59:32,380
So getting parallelism
is important.

1428
00:59:32,380 --> 00:59:35,175
But then, if you don't have
locality, it doesn't matter.

1429
00:59:35,175 --> 00:59:37,300
Because most of the time
you're waiting to get data

1430
00:59:37,300 --> 00:59:38,508
from all the way from memory.

1431
00:59:38,508 --> 00:59:40,090
So you had to get good locality.

1432
00:59:40,090 --> 00:59:41,800
And then more-- a
lot of times you

1433
00:59:41,800 --> 00:59:45,170
can do that really well if
you do some extra computation.

1434
00:59:45,170 --> 00:59:47,410
But if you do too much extra
things, that's going to,

1435
00:59:47,410 --> 00:59:48,800
oh well, that's not
going to help you.

1436
00:59:48,800 --> 00:59:50,650
So it's all about
playing the distribution.

1437
00:59:50,650 --> 00:59:51,790
You've got a final project.

1438
00:59:51,790 --> 00:59:53,330
That's exactly what
you're going to do.

1439
00:59:53,330 --> 00:59:55,288
You might say, ah, if I
can do some extra work,

1440
00:59:55,288 --> 00:59:56,590
OK, I can do this faster.

1441
00:59:56,590 --> 00:59:59,950
But oops, no, this extra
pre-compute pass, or whatever,

1442
00:59:59,950 --> 01:00:00,490
it's not--

1443
01:00:00,490 --> 01:00:01,870
I can't amortize the cost.

1444
01:00:01,870 --> 01:00:07,330
So there's these three things
that you're trading over that.

1445
01:00:07,330 --> 01:00:10,040
So that's one interesting thing.

1446
01:00:10,040 --> 01:00:15,150
Another thing is we made it
available for the programmers

1447
01:00:15,150 --> 01:00:18,220
to do this scheduling language.

1448
01:00:18,220 --> 01:00:20,685
But can you make it--

1449
01:00:20,685 --> 01:00:22,060
can you think of
a way to make it

1450
01:00:22,060 --> 01:00:24,230
a little bit easier
for programmers

1451
01:00:24,230 --> 01:00:25,630
than doing a
scheduling language?

1452
01:00:25,630 --> 01:00:27,038
What can I do?

1453
01:00:27,038 --> 01:00:29,080
What's the nice thing
about scheduling languages?

1454
01:00:31,950 --> 01:00:34,090
It's very simple.

1455
01:00:34,090 --> 01:00:36,218
It has a very simple pattern.

1456
01:00:36,218 --> 01:00:39,483
AUDIENCE: [INAUDIBLE]

1457
01:00:39,483 --> 01:00:41,650
SAMAN AMARASIGNHE: Yeah, I
mean, that's-- the number

1458
01:00:41,650 --> 01:00:42,340
of options--

1459
01:00:42,340 --> 01:00:44,140
it's not like you can
write any program.

1460
01:00:44,140 --> 01:00:47,480
There are certain things
you can do in the schedule.

1461
01:00:47,480 --> 01:00:51,040
So if you know that
space, you can sort of

1462
01:00:51,040 --> 01:00:52,790
use doing that smartly.

1463
01:00:52,790 --> 01:00:55,545
What else can we do with it?

1464
01:00:55,545 --> 01:00:56,735
AUDIENCE: Test them all.

1465
01:00:56,735 --> 01:00:58,110
SAMAN AMARASIGNHE:
Test them all.

1466
01:00:58,110 --> 01:00:59,270
That's one approach there.

1467
01:00:59,270 --> 01:01:01,520
AUDIENCE: Use autotuning,
trying to find [INAUDIBLE]..

1468
01:01:01,520 --> 01:01:03,187
SAMAN AMARASIGNHE:
We can do autotuning.

1469
01:01:03,187 --> 01:01:07,160
So that switched into the
autotuning part of this talk.

1470
01:01:07,160 --> 01:01:10,200
So performance engineering
basically, most of the time,

1471
01:01:10,200 --> 01:01:13,350
is finding right
these crazy things.

1472
01:01:13,350 --> 01:01:16,560
Like you start looking
at, I think, probably

1473
01:01:16,560 --> 01:01:18,840
as Charles talked, what
this voodoo parameters,

1474
01:01:18,840 --> 01:01:20,880
like, OK, what's the
right block size?

1475
01:01:20,880 --> 01:01:23,220
And it has a big
impact, finding that.

1476
01:01:23,220 --> 01:01:25,112
A newer memory
allocation project,

1477
01:01:25,112 --> 01:01:26,570
you had to find
the right strategy,

1478
01:01:26,570 --> 01:01:27,850
right memory allocation.

1479
01:01:27,850 --> 01:01:29,930
You searched through a
bunch of these things.

1480
01:01:29,930 --> 01:01:33,600
You and GCC compiler,
there are, I

1481
01:01:33,600 --> 01:01:38,400
think, about 400
different flags for GCC.

1482
01:01:38,400 --> 01:01:42,270
And you can actually get
factor of four performance

1483
01:01:42,270 --> 01:01:47,330
by having a [INAUDIBLE]
200 flags into GCC.

1484
01:01:47,330 --> 01:01:48,960
It's crazy.

1485
01:01:48,960 --> 01:01:52,297
And that 200 flags is not
the same for every program.

1486
01:01:52,297 --> 01:01:54,630
And, of course, some programs
will crash in some places.

1487
01:01:54,630 --> 01:01:57,330
Most of the time, it'll
slow down or speed up.

1488
01:01:57,330 --> 01:02:00,850
So you can just give all the
flags of GCC and autotune that.

1489
01:02:00,850 --> 01:02:03,330
And you can get factor of two
to four performance in there.

1490
01:02:03,330 --> 01:02:04,170
It's just crazy.

1491
01:02:04,170 --> 01:02:06,810
And then, because it do weird
things in there, and then 01,

1492
01:02:06,810 --> 01:02:09,450
02, 03 will only
do certain amount.

1493
01:02:09,450 --> 01:02:11,272
So 03 doesn't-- it's
not always right.

1494
01:02:11,272 --> 01:02:12,480
You can try all these things.

1495
01:02:12,480 --> 01:02:14,400
So you can tune that.

1496
01:02:14,400 --> 01:02:18,210
And scheduling Halide,
scheduling GraphIt,

1497
01:02:18,210 --> 01:02:21,620
all these things
can be autotuned.

1498
01:02:21,620 --> 01:02:25,500
So before autotuning, when
you have a large search space,

1499
01:02:25,500 --> 01:02:27,560
what do we normally do?

1500
01:02:27,560 --> 01:02:31,050
The thing that when we think
we are smart, what we do

1501
01:02:31,050 --> 01:02:32,310
is we build models.

1502
01:02:32,310 --> 01:02:33,558
We have model for a cache.

1503
01:02:33,558 --> 01:02:35,100
And say, can we
understand the cache?

1504
01:02:35,100 --> 01:02:37,080
We have all these
nice models in here.

1505
01:02:37,080 --> 01:02:39,160
And using the model,
I can predict, ah ha,

1506
01:02:39,160 --> 01:02:41,868
this is the right block size.

1507
01:02:41,868 --> 01:02:43,910
So what's the problem when
you try to do a model?

1508
01:02:49,333 --> 01:02:52,300
AUDIENCE: Sometime it
doesn't work [INAUDIBLE]..

1509
01:02:52,300 --> 01:02:54,100
SAMAN AMARASIGNHE:
Exactly, because most

1510
01:02:54,100 --> 01:02:57,818
of the time, when you try to do
a model, you have to abstract.

1511
01:02:57,818 --> 01:02:59,860
And most of the time,
you-- what you abstract out

1512
01:02:59,860 --> 01:03:01,840
might be the most important
part of the darn thing

1513
01:03:01,840 --> 01:03:02,840
that we didn't consider.

1514
01:03:02,840 --> 01:03:04,450
So we built a model for cache.

1515
01:03:04,450 --> 01:03:06,587
But oops, I could--
didn't figure out pages.

1516
01:03:06,587 --> 01:03:08,170
Well, the pages made
a big difference.

1517
01:03:08,170 --> 01:03:09,940
So there might be
things in real life

1518
01:03:09,940 --> 01:03:12,363
that matters that didn't
fit into your model,

1519
01:03:12,363 --> 01:03:14,530
or you didn't know it needed
to be fit into a model.

1520
01:03:14,530 --> 01:03:16,030
If you try to put
everything, that's

1521
01:03:16,030 --> 01:03:17,260
too complicated of a model.

1522
01:03:17,260 --> 01:03:18,850
So you abstract something out.

1523
01:03:18,850 --> 01:03:22,830
You can say, I have optimal
result for this model.

1524
01:03:22,830 --> 01:03:25,450
But that optimal
result might be way off

1525
01:03:25,450 --> 01:03:27,070
than the simple
result you can get.

1526
01:03:27,070 --> 01:03:29,785
Because the thing that
you didn't put into

1527
01:03:29,785 --> 01:03:31,160
the model are the
important ones.

1528
01:03:31,160 --> 01:03:32,500
So model doesn't work.

1529
01:03:32,500 --> 01:03:34,990
The next thing is what you
do, heuristic-based thing.

1530
01:03:34,990 --> 01:03:37,780
This is where these old
people come and say,

1531
01:03:37,780 --> 01:03:38,697
I know how to do this.

1532
01:03:38,697 --> 01:03:41,155
In order to do this, you need
to do this thing, this thing,

1533
01:03:41,155 --> 01:03:41,860
this thing.

1534
01:03:41,860 --> 01:03:48,460
You can come up with some
kind of the old grandmother's

1535
01:03:48,460 --> 01:03:49,420
solution type thing.

1536
01:03:49,420 --> 01:03:51,550
There are certain things
that will always work.

1537
01:03:51,550 --> 01:03:52,870
And you hardcode them.

1538
01:03:52,870 --> 01:03:57,790
So you can say if the matrix
dimension is more than 1,000,

1539
01:03:57,790 --> 01:04:02,133
always go to block, or some
kind of rules like that.

1540
01:04:02,133 --> 01:04:03,550
These rules work
most of the time.

1541
01:04:03,550 --> 01:04:08,820
But obviously, there are certain
cases that rules doesn't work.

1542
01:04:08,820 --> 01:04:12,095
Worse, that rules might be
set for a certain machine,

1543
01:04:12,095 --> 01:04:13,720
certain architecture,
all those things.

1544
01:04:13,720 --> 01:04:14,720
I'll give you the story.

1545
01:04:14,720 --> 01:04:25,330
So GCC has this fast
table sort routine.

1546
01:04:25,330 --> 01:04:29,380
So fast table sort
routine says sort

1547
01:04:29,380 --> 01:04:31,930
using a parallel quick sort.

1548
01:04:31,930 --> 01:04:37,030
And when the number exceeds
16, switch to in session sort.

1549
01:04:37,030 --> 01:04:38,610
It's hardcoded in GCC.

1550
01:04:38,610 --> 01:04:41,590
It's like, wow, some
amazing person figured

1551
01:04:41,590 --> 01:04:44,020
out 16, this amazing
number, has to switch

1552
01:04:44,020 --> 01:04:45,890
from parallel quick
sort to insertion sort.

1553
01:04:45,890 --> 01:04:47,473
So we are trying to
figure out, what's

1554
01:04:47,473 --> 01:04:49,930
the profoundness of this number?

1555
01:04:49,930 --> 01:04:54,220
The profoundness of this
number is somewhere around 1995

1556
01:04:54,220 --> 01:04:55,900
when this code was released.

1557
01:04:55,900 --> 01:04:58,960
In those machines, that
was the right number.

1558
01:04:58,960 --> 01:05:02,170
That 16 was a really good number
to switch from parallelism

1559
01:05:02,170 --> 01:05:04,840
to doing that, because the
cache size, stuff like that.

1560
01:05:04,840 --> 01:05:07,030
But that 16 survived
from 1995 [INAUDIBLE]

1561
01:05:07,030 --> 01:05:08,200
even two dates there.

1562
01:05:08,200 --> 01:05:12,100
Today that number
should be like 500.

1563
01:05:12,100 --> 01:05:13,750
But it's in there,
because somebody

1564
01:05:13,750 --> 01:05:15,792
thought 16 is the right,
it's hardcoded in there.

1565
01:05:15,792 --> 01:05:17,080
It didn't change.

1566
01:05:17,080 --> 01:05:20,275
So there are a lot of things
in compilers code like that,

1567
01:05:20,275 --> 01:05:22,720
that they draw, that
some programmer said,

1568
01:05:22,720 --> 01:05:23,830
I know what works here.

1569
01:05:23,830 --> 01:05:24,740
This fits in there.

1570
01:05:24,740 --> 01:05:25,660
You put it in there.

1571
01:05:25,660 --> 01:05:27,070
But there's no rhyme or reason.

1572
01:05:27,070 --> 01:05:29,140
Because at that time,
they had a reason.

1573
01:05:29,140 --> 01:05:30,430
But it doesn't scale.

1574
01:05:30,430 --> 01:05:35,115
So a lot of these heuristics
get out of focus very fast.

1575
01:05:35,115 --> 01:05:36,490
And there's no
theory behind this

1576
01:05:36,490 --> 01:05:37,927
to say, now, how do you update?

1577
01:05:37,927 --> 01:05:39,760
You had to ask people,
why did you put that?

1578
01:05:39,760 --> 01:05:41,510
And then it's, oh yeah,
because my machine

1579
01:05:41,510 --> 01:05:43,400
has a 32 kilobytes of cache.

1580
01:05:43,400 --> 01:05:45,550
It's like, oh, OK, that's
a different machine

1581
01:05:45,550 --> 01:05:47,810
what we have today.

1582
01:05:47,810 --> 01:05:49,360
So that's the problem in here.

1583
01:05:49,360 --> 01:05:51,992
And then other thing is you
can do exhaustive search.

1584
01:05:51,992 --> 01:05:53,950
You can say, OK, I'll
try every possible thing.

1585
01:05:53,950 --> 01:05:56,740
The problem here is
sometimes my search base

1586
01:05:56,740 --> 01:05:59,650
is 10 to the power 10.

1587
01:05:59,650 --> 01:06:03,820
You don't have enough
seconds in your lifetime

1588
01:06:03,820 --> 01:06:04,840
to do that search.

1589
01:06:04,840 --> 01:06:06,132
So it would be too complicated.

1590
01:06:06,132 --> 01:06:09,310
And that's where the
autotuner comes in.

1591
01:06:09,310 --> 01:06:12,060
So-- oh, OK, actually I have
a little bit more slides here.

1592
01:06:12,060 --> 01:06:13,887
So model based
solution is you come up

1593
01:06:13,887 --> 01:06:15,970
with this comprehensive
model, like a cache model,

1594
01:06:15,970 --> 01:06:17,200
or something like that.

1595
01:06:17,200 --> 01:06:18,865
And you do that.

1596
01:06:18,865 --> 01:06:23,500
And you can exactly show what's
right for the optimal solution

1597
01:06:23,500 --> 01:06:24,460
in here.

1598
01:06:24,460 --> 01:06:26,920
But the problem is
hard to build models,

1599
01:06:26,920 --> 01:06:30,880
cannot model everything, and
most of the time modeling is

1600
01:06:30,880 --> 01:06:33,592
the most important thing.

1601
01:06:33,592 --> 01:06:35,050
Heuristic-based
things are the rule

1602
01:06:35,050 --> 01:06:37,342
of the thumb kind of solution
that you come up and say,

1603
01:06:37,342 --> 01:06:39,220
it's hardcoded in there.

1604
01:06:39,220 --> 01:06:41,110
And it's very simple and easy.

1605
01:06:41,110 --> 01:06:43,660
It works most of the
time, if you get it right.

1606
01:06:43,660 --> 01:06:45,130
But the problem
is too simplistic.

1607
01:06:45,130 --> 01:06:46,060
It doesn't scale.

1608
01:06:46,060 --> 01:06:50,400
It doesn't stand the test of
time, most of the time in here.

1609
01:06:50,400 --> 01:06:52,640
An exhaustive search is great.

1610
01:06:52,640 --> 01:06:56,980
But the problem is
just way too much

1611
01:06:56,980 --> 01:07:02,290
possibility of searching in
here, too big of search base.

1612
01:07:02,290 --> 01:07:03,440
You can't do that.

1613
01:07:03,440 --> 01:07:06,760
So this is where you want
to prune the search base.

1614
01:07:06,760 --> 01:07:08,560
And the pruning, the
best way to do that

1615
01:07:08,560 --> 01:07:11,290
is basically use autotuning.

1616
01:07:11,290 --> 01:07:13,870
So what autotuning you can do
is you can define the space

1617
01:07:13,870 --> 01:07:17,980
of acceptable values nicely,
choose a value at random--

1618
01:07:17,980 --> 01:07:20,980
that's what the system
will do, try it out there--

1619
01:07:20,980 --> 01:07:23,230
and evaluate the performance
of that value end to end.

1620
01:07:23,230 --> 01:07:24,340
Because end to end matters.

1621
01:07:24,340 --> 01:07:26,340
Because if you try to
predict, most of the time,

1622
01:07:26,340 --> 01:07:27,650
it might not work.

1623
01:07:27,650 --> 01:07:33,820
And if satisfies the performance
that you need, you're done.

1624
01:07:33,820 --> 01:07:36,030
Otherwise, choose a new
value and iterate over there,

1625
01:07:36,030 --> 01:07:36,870
go to three in there.

1626
01:07:36,870 --> 01:07:38,090
So this is the kind of thing.

1627
01:07:38,090 --> 01:07:39,970
And what you have
to do is you need

1628
01:07:39,970 --> 01:07:45,430
to have a system to figure out
how to do this fast, basically

1629
01:07:45,430 --> 01:07:49,120
what space to basically do that,
when to basically think you're

1630
01:07:49,120 --> 01:07:52,220
done, how to go through the
iterating loops through that.

1631
01:07:52,220 --> 01:07:55,250
So this is the kind of--
with cartoonish way, what

1632
01:07:55,250 --> 01:07:56,930
happens is you give
a value candidate,

1633
01:07:56,930 --> 01:07:59,720
you compile the program, you
run it with bunch of data,

1634
01:07:59,720 --> 01:08:00,620
you are running
through the bunch,

1635
01:08:00,620 --> 01:08:01,870
otherwise you are all fitting.

1636
01:08:01,870 --> 01:08:03,110
You can't run it with one.

1637
01:08:03,110 --> 01:08:04,110
And you get the results.

1638
01:08:04,110 --> 01:08:05,900
And you get something
like average.

1639
01:08:05,900 --> 01:08:09,800
And if you go through
this loop in here.

1640
01:08:09,800 --> 01:08:13,100
And what OpenTuner
has done is come up

1641
01:08:13,100 --> 01:08:14,910
with the ensemble of techniques.

1642
01:08:14,910 --> 01:08:18,319
So the idea there is when you're
searching through a space,

1643
01:08:18,319 --> 01:08:21,560
you might be at the bottom
of a hill of the space.

1644
01:08:21,560 --> 01:08:24,050
So what that means is there
are certain value, if you keep

1645
01:08:24,050 --> 01:08:25,580
improving in value, that
you are getting good

1646
01:08:25,580 --> 01:08:26,740
better, and better, and better.

1647
01:08:26,740 --> 01:08:28,840
And at that time, something
like a hill climber--

1648
01:08:32,510 --> 01:08:35,548
hill climber, or somebody like
an [INAUDIBLE] hill climber

1649
01:08:35,548 --> 01:08:37,340
can actually give you
the best performance.

1650
01:08:37,340 --> 01:08:38,673
You're going very fast in there.

1651
01:08:38,673 --> 01:08:41,782
But meet you at the top of the
hill for that pyramid, oops,

1652
01:08:41,782 --> 01:08:42,740
there's no place to go.

1653
01:08:42,740 --> 01:08:44,450
So that if you
tried to hill climb,

1654
01:08:44,450 --> 01:08:45,710
it's not going to be helpful.

1655
01:08:45,710 --> 01:08:47,335
So at that time, what
do you want to do

1656
01:08:47,335 --> 01:08:50,760
is do something like a
random search in here.

1657
01:08:50,760 --> 01:08:55,240
So what this system do,
OpenTuner, you do it,

1658
01:08:55,240 --> 01:08:57,830
it will basically test
this request in there.

1659
01:08:57,830 --> 01:09:00,260
And if something
is doing very well,

1660
01:09:00,260 --> 01:09:02,840
it will give it more time.

1661
01:09:02,840 --> 01:09:06,020
If not, what it will do
is it will say, OK, look,

1662
01:09:06,020 --> 01:09:08,128
this is-- this technique
is not working.

1663
01:09:08,128 --> 01:09:09,170
Let's try something else.

1664
01:09:09,170 --> 01:09:10,920
It'll basically allocate
the time in here.

1665
01:09:10,920 --> 01:09:15,950
So it do this search much faster
than otherwise you can do.

1666
01:09:15,950 --> 01:09:18,319
So I want to finish this
by showing what you need

1667
01:09:18,319 --> 01:09:20,463
for autotuning for GraphIt.

1668
01:09:20,463 --> 01:09:22,380
So we have algorithm,
and you have a schedule.

1669
01:09:22,380 --> 01:09:25,202
It's a pain to
write this schedule.

1670
01:09:25,202 --> 01:09:26,910
In fact, there's a
good [? interesting ?]

1671
01:09:26,910 --> 01:09:30,500
thing in-- when
you do Halide, we

1672
01:09:30,500 --> 01:09:34,640
decided, OK, it should be
similar to write the algorithm

1673
01:09:34,640 --> 01:09:36,290
for Halide and the schedule.

1674
01:09:36,290 --> 01:09:39,580
Google very fast realized
many people won't use Halide.

1675
01:09:39,580 --> 01:09:42,740
And they-- at about two
years, they had about hundreds

1676
01:09:42,740 --> 01:09:45,109
of programmers who can
write the algorithm.

1677
01:09:45,109 --> 01:09:46,880
But they only had
five people who could

1678
01:09:46,880 --> 01:09:48,380
write the really good schedule.

1679
01:09:48,380 --> 01:09:49,880
To write a really
good schedule, you

1680
01:09:49,880 --> 01:09:51,420
need to understand a little
bit of the algorithm.

1681
01:09:51,420 --> 01:09:52,950
You need to understand a
little bit of the architecture,

1682
01:09:52,950 --> 01:09:54,229
a little bit of everything.

1683
01:09:54,229 --> 01:09:56,860
And that's much harder
for people to learn.

1684
01:09:56,860 --> 01:10:00,362
So getting the schedule
right is not that easy.

1685
01:10:00,362 --> 01:10:01,820
And same thing in
here, because you

1686
01:10:01,820 --> 01:10:05,050
need to understand a lot
unless you do kind of random--

1687
01:10:05,050 --> 01:10:08,330
you've got certain arbitrary,
but to do it right,

1688
01:10:08,330 --> 01:10:10,130
you need to know
a little bit more.

1689
01:10:10,130 --> 01:10:13,550
So what we can do
is we can basically

1690
01:10:13,550 --> 01:10:17,142
give some idea about
the graphs and some idea

1691
01:10:17,142 --> 01:10:18,350
about the algorithm in there.

1692
01:10:18,350 --> 01:10:19,892
We can autotune
these things in there

1693
01:10:19,892 --> 01:10:23,120
and then generate the schedule.

1694
01:10:23,120 --> 01:10:26,300
And so what we found was to
generate this schedule, if you

1695
01:10:26,300 --> 01:10:28,890
do exhaustive search,
it runs for days.

1696
01:10:28,890 --> 01:10:31,580
But if you're using
autotuner, OpenTuner,

1697
01:10:31,580 --> 01:10:34,820
you can find a really
good schedule for--

1698
01:10:34,820 --> 01:10:37,040
in less than two hours.

1699
01:10:37,040 --> 01:10:40,010
And, in fact, a few
cases we found schedules

1700
01:10:40,010 --> 01:10:42,798
that run better
than what we thought

1701
01:10:42,798 --> 01:10:44,090
was the best possible schedule.

1702
01:10:44,090 --> 01:10:46,250
Because it was able to--

1703
01:10:46,250 --> 01:10:50,960
because it was able to search
much better than our intuition

1704
01:10:50,960 --> 01:10:51,860
would say in here.

1705
01:10:51,860 --> 01:10:53,960
And when-- and even if
our intuition know it,

1706
01:10:53,960 --> 01:10:56,420
it has more time to try
many different combinations

1707
01:10:56,420 --> 01:10:58,950
and trying something in--
come something better in here.

1708
01:10:58,950 --> 01:11:03,850
So that's all I have today.