1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:22,920 --> 00:00:26,040 PROFESSOR: So it's my pleasure to introduce Professor Saman 9 00:00:26,040 --> 00:00:29,170 Amarasignhe as our guest lecturer today. 10 00:00:29,170 --> 00:00:31,530 So Saman Amarasignhe is a professor 11 00:00:31,530 --> 00:00:33,640 in the EECS Department at MIT. 12 00:00:33,640 --> 00:00:36,300 And he's also the associate department head. 13 00:00:36,300 --> 00:00:40,560 He's an expert in compilers, domain specific languages, 14 00:00:40,560 --> 00:00:41,580 and autotuning. 15 00:00:41,580 --> 00:00:44,940 In fact, he was the designer of the OpenTuner framework 16 00:00:44,940 --> 00:00:48,150 that you've been using for your homework assignments. 17 00:00:48,150 --> 00:00:50,010 So today Saman is going to tell us 18 00:00:50,010 --> 00:00:53,190 about some of his recent work on domain specific languages 19 00:00:53,190 --> 00:00:54,840 and also on autotuning. 20 00:00:54,840 --> 00:00:57,840 So let's give Saman Amarasignhe a round of applause. 21 00:00:57,840 --> 00:00:59,160 [APPLAUSE] 22 00:00:59,160 --> 00:01:01,920 SAMAN AMARASIGNHE: Thank you. 23 00:01:01,920 --> 00:01:06,930 OK, so I used to teach this class for many, many years. 24 00:01:06,930 --> 00:01:10,740 Unfortunately, now I am administrator, so I don't-- 25 00:01:10,740 --> 00:01:13,740 Julian and Charles get to have the fun teaching the class. 26 00:01:13,740 --> 00:01:17,243 So hopefully you guys enjoyed your-- 27 00:01:17,243 --> 00:01:19,410 now you're starting-- you are done project one, two, 28 00:01:19,410 --> 00:01:21,800 and three and going into project four? 29 00:01:21,800 --> 00:01:23,640 Yeah, project four is really fun and-- 30 00:01:23,640 --> 00:01:25,790 [LAUGHTER] 31 00:01:25,790 --> 00:01:28,650 It will look big and daunting, but at the end, 32 00:01:28,650 --> 00:01:33,450 you'll enjoy it, and especially all the amount of time people 33 00:01:33,450 --> 00:01:35,670 spend working on it. 34 00:01:35,670 --> 00:01:38,240 So I think I'm making you scared here, more than anything. 35 00:01:38,240 --> 00:01:40,420 OK, so let's get into the talk today. 36 00:01:40,420 --> 00:01:42,690 So I will talk to you about domain specific languages 37 00:01:42,690 --> 00:01:44,190 and a little bit about autotuning, 38 00:01:44,190 --> 00:01:45,720 how it leads to autotuning. 39 00:01:45,720 --> 00:01:49,050 So why is domain specific languages? 40 00:01:49,050 --> 00:01:54,360 So we are all used to general purpose languages 41 00:01:54,360 --> 00:01:56,190 that we use all day. 42 00:01:56,190 --> 00:01:58,530 And those languages are set up to capture 43 00:01:58,530 --> 00:02:01,380 a very large sort of what people might 44 00:02:01,380 --> 00:02:03,780 want to do in programming. 45 00:02:03,780 --> 00:02:06,120 However, a lot of times there are 46 00:02:06,120 --> 00:02:11,580 specific areas, specific domains, either some area in-- 47 00:02:11,580 --> 00:02:14,130 that you want to implement, or the certain patterns you want 48 00:02:14,130 --> 00:02:17,640 to implement code that has a lot of interesting properties 49 00:02:17,640 --> 00:02:19,320 that in a general purpose language 50 00:02:19,320 --> 00:02:21,600 it's very hard to describe. 51 00:02:21,600 --> 00:02:25,770 And a lot of times it's basically very hard, 52 00:02:25,770 --> 00:02:28,620 especially from compiler point of view, to take advantage. 53 00:02:28,620 --> 00:02:31,350 Because it has to work for everybody. 54 00:02:31,350 --> 00:02:33,330 So domain specific languages basically 55 00:02:33,330 --> 00:02:36,360 has this lot of integrated benefits. 56 00:02:36,360 --> 00:02:39,750 Because if you know that you are-- what you're building 57 00:02:39,750 --> 00:02:42,360 has a certain shape, certain set of properties, 58 00:02:42,360 --> 00:02:46,180 if the language captured this, and if you're building on that, 59 00:02:46,180 --> 00:02:48,150 it could be much easier to build. 60 00:02:48,150 --> 00:02:50,190 It should have a lot of clarity. 61 00:02:50,190 --> 00:02:52,985 It's very easy to maintain that kind of thing. 62 00:02:52,985 --> 00:02:54,660 It's very easy to test. 63 00:02:54,660 --> 00:02:57,600 And also, the other thing is, it's very easy to understand. 64 00:02:57,600 --> 00:03:00,990 Because the domain is very clearly described. 65 00:03:00,990 --> 00:03:03,090 If-- you can build a library, but somebody 66 00:03:03,090 --> 00:03:05,400 can go and do weird things in library. 67 00:03:05,400 --> 00:03:08,752 If it is built into the language, it's set in stone. 68 00:03:08,752 --> 00:03:10,210 You can't go and say, oh, yeah, I'm 69 00:03:10,210 --> 00:03:11,590 going to change something here. 70 00:03:11,590 --> 00:03:13,120 Let me do some weird thing here. 71 00:03:13,120 --> 00:03:14,520 It's built into the language. 72 00:03:14,520 --> 00:03:16,290 So it stays there. 73 00:03:16,290 --> 00:03:18,870 It makes it much easier for programmers [INAUDIBLE].. 74 00:03:18,870 --> 00:03:22,170 But from my point of view, the domain specific language 75 00:03:22,170 --> 00:03:25,740 I really like are the languages where 76 00:03:25,740 --> 00:03:30,210 I know I can take advantage of knowledge of the domain experts 77 00:03:30,210 --> 00:03:32,490 to get really good performance. 78 00:03:32,490 --> 00:03:33,930 So a lot of times, domain experts 79 00:03:33,930 --> 00:03:35,793 say, ah ha, in this domain, I can do-- 80 00:03:35,793 --> 00:03:38,460 OK, there's some linear algebra, but I know this kind of algebra 81 00:03:38,460 --> 00:03:40,890 that I can do to simplify the expression. 82 00:03:40,890 --> 00:03:42,730 That algebra might only work on that domain. 83 00:03:42,730 --> 00:03:47,700 It's very hard to put some complex algebra into C++ or C. 84 00:03:47,700 --> 00:03:49,980 But in that domain, I can say, ha, I can call it up. 85 00:03:49,980 --> 00:03:54,450 So you can write any expression that I can simplify it. 86 00:03:54,450 --> 00:03:58,200 And also, there are a lot of idioms in each domain. 87 00:03:58,200 --> 00:04:00,220 So some domain might say, OK, look, 88 00:04:00,220 --> 00:04:02,940 I am going to represent a graph that I'm going to talk about. 89 00:04:02,940 --> 00:04:06,270 In the normal C++, you create a bunch of classes. 90 00:04:06,270 --> 00:04:07,980 You do these very complicated things. 91 00:04:07,980 --> 00:04:09,720 The idiom is hidden in there. 92 00:04:09,720 --> 00:04:12,870 First of all, C++ doesn't know that I had to look for graphs. 93 00:04:12,870 --> 00:04:15,000 But even if you had look for graphs, 94 00:04:15,000 --> 00:04:17,820 you can try graphs in hundreds of millions of ways. 95 00:04:17,820 --> 00:04:20,130 But if it is a first class supporting the language, 96 00:04:20,130 --> 00:04:22,440 I don't have to work heroically to extract that. 97 00:04:22,440 --> 00:04:23,293 It's there. 98 00:04:23,293 --> 00:04:24,210 I can easily see that. 99 00:04:24,210 --> 00:04:27,630 So most of my compiler can be doing useful things in there. 100 00:04:27,630 --> 00:04:29,640 And most of the time, the other thing 101 00:04:29,640 --> 00:04:32,460 is, if you build a domain specific language right, 102 00:04:32,460 --> 00:04:35,730 you can leave the complex, the lower level decision, 103 00:04:35,730 --> 00:04:37,260 to the compiler. 104 00:04:37,260 --> 00:04:40,340 And if you-- C++, you might be tempted to say, eh, 105 00:04:40,340 --> 00:04:41,400 I know some optimization. 106 00:04:41,400 --> 00:04:42,442 Let me do something here. 107 00:04:42,442 --> 00:04:45,410 Oh, let me do some of the optimizations here. 108 00:04:45,410 --> 00:04:48,550 So I have been working on optimization all my life. 109 00:04:48,550 --> 00:04:51,150 And a lot of times, when you write a compile optimization 110 00:04:51,150 --> 00:04:54,750 part, you spend half of or more than half of your time undoing 111 00:04:54,750 --> 00:04:57,860 the crazy optimization the programmer did, 112 00:04:57,860 --> 00:04:59,100 like you guys are learning. 113 00:04:59,100 --> 00:05:00,600 So this-- you think you know better. 114 00:05:00,600 --> 00:05:01,990 You go do something. 115 00:05:01,990 --> 00:05:04,230 And that might work well then, but believe me, 116 00:05:04,230 --> 00:05:06,390 that code survives 20 years later. 117 00:05:06,390 --> 00:05:09,210 And 20 years later, that looks like a really stupid thing 118 00:05:09,210 --> 00:05:10,090 to do. 119 00:05:10,090 --> 00:05:11,430 And then you look at it and say, OK, now I 120 00:05:11,430 --> 00:05:13,055 had to undo everything in the compiler, 121 00:05:13,055 --> 00:05:15,670 do the right thing in the current architecture. 122 00:05:15,670 --> 00:05:20,020 And because of that, if you capture the right level, 123 00:05:20,020 --> 00:05:22,590 I will let the compiler do the work in here. 124 00:05:22,590 --> 00:05:24,630 And then as the architectures keep maturing, 125 00:05:24,630 --> 00:05:27,370 as the problems keep changing, I don't have to worry. 126 00:05:27,370 --> 00:05:29,490 I don't have to undo these parts in here. 127 00:05:29,490 --> 00:05:34,320 So again, I'm coming to the performance engineering class 128 00:05:34,320 --> 00:05:36,820 and telling you guys, leave the performance to the compiler. 129 00:05:36,820 --> 00:05:39,210 But that's the nice thing, that if the compiler 130 00:05:39,210 --> 00:05:42,570 can do the most of your work, much nicer job. 131 00:05:42,570 --> 00:05:46,620 So don't doubt the compiler. 132 00:05:46,620 --> 00:05:50,097 So I'm going to talk about three parts in here. 133 00:05:50,097 --> 00:05:51,930 One is three different programming languages 134 00:05:51,930 --> 00:05:55,200 in here, domain specific languages, GraphIt, Halide, 135 00:05:55,200 --> 00:05:58,500 and then OpenTuner, which is not just the language, 136 00:05:58,500 --> 00:06:00,050 but the framework in here. 137 00:06:00,050 --> 00:06:03,240 And between GraphIt and Halide, you will see some patterns. 138 00:06:03,240 --> 00:06:05,550 And then we'll see whether you found the pattern 139 00:06:05,550 --> 00:06:07,830 that we are working on in here. 140 00:06:07,830 --> 00:06:11,310 So GraphIt, this is a product that I worked with Julian. 141 00:06:11,310 --> 00:06:14,985 So if you have any questions of GraphIt after today, 142 00:06:14,985 --> 00:06:16,360 you can definitely go ask Julian. 143 00:06:16,360 --> 00:06:19,320 He knows probably more about graphs and GraphIt-- 144 00:06:19,320 --> 00:06:21,620 more about graphs than probably anybody on this planet. 145 00:06:21,620 --> 00:06:24,980 So he's a good resource to talk about graphs. 146 00:06:24,980 --> 00:06:29,050 So talking about graphs, graphs everywhere. 147 00:06:29,050 --> 00:06:35,880 So if you go to something like Google and do some search, 148 00:06:35,880 --> 00:06:38,130 Google has represented the entire knowledge 149 00:06:38,130 --> 00:06:39,810 on the internet as a big graph. 150 00:06:39,810 --> 00:06:42,570 They have done a huge amount of graph processing behind you. 151 00:06:42,570 --> 00:06:46,240 That is how-- what guides your search in there. 152 00:06:46,240 --> 00:06:52,810 Or if you go, again, maps, or something like Uber, 153 00:06:52,810 --> 00:06:54,500 it will find your directions. 154 00:06:54,500 --> 00:06:56,478 The entire road network is a graph. 155 00:06:56,478 --> 00:06:58,520 And it's trying to find things like shortest path 156 00:06:58,520 --> 00:07:01,010 in this graph to give you the map. 157 00:07:01,010 --> 00:07:03,510 And if you go to a recommendation engine to get 158 00:07:03,510 --> 00:07:06,540 a recommendation for a movie, if you get a really cool movie you 159 00:07:06,540 --> 00:07:10,170 like, that's because there's a huge graph between everybody 160 00:07:10,170 --> 00:07:11,910 who's-- which movie they've watched, 161 00:07:11,910 --> 00:07:13,740 and their likings to those movies. 162 00:07:13,740 --> 00:07:16,033 And they are looking and comparing you to them 163 00:07:16,033 --> 00:07:16,950 and recommending that. 164 00:07:16,950 --> 00:07:19,650 That's all can be viewed as graphs in here. 165 00:07:19,650 --> 00:07:27,000 And even if you go to an ATM and try to do a transaction, 166 00:07:27,000 --> 00:07:29,340 there's a very fast graph analysis back 167 00:07:29,340 --> 00:07:31,740 to say is this a fraudulent transaction or not? 168 00:07:31,740 --> 00:07:34,380 So most of the transactions people have done, 169 00:07:34,380 --> 00:07:36,930 all the connectivities in the back in there, 170 00:07:36,930 --> 00:07:39,660 before the time that actually the money pops out of your ATM 171 00:07:39,660 --> 00:07:43,170 machine, it has done a bunch of graph processes to understand, 172 00:07:43,170 --> 00:07:45,180 OK, this seems like a good transaction. 173 00:07:45,180 --> 00:07:46,920 So I will actually give you the money. 174 00:07:46,920 --> 00:07:49,690 Sometimes you say-- you get this other message that 175 00:07:49,690 --> 00:07:51,930 mean the graph processing decided 176 00:07:51,930 --> 00:07:54,470 there might be some weird thing going on there. 177 00:07:54,470 --> 00:07:57,510 So a lot of these things that some of them, 178 00:07:57,510 --> 00:08:00,210 like maps and graphs has-- 179 00:08:00,210 --> 00:08:03,400 maps and these transactions have very fine latency thing 180 00:08:03,400 --> 00:08:03,900 in there. 181 00:08:03,900 --> 00:08:05,280 You have to get this thing done right. 182 00:08:05,280 --> 00:08:06,630 You have to get good directions. 183 00:08:06,630 --> 00:08:08,240 Especially if you take a wrong turn, 184 00:08:08,240 --> 00:08:09,990 you need to get the next set of directions 185 00:08:09,990 --> 00:08:13,360 done very fast before you go hit some Boston bad weird traffic. 186 00:08:13,360 --> 00:08:15,270 So these things have to work fast. 187 00:08:15,270 --> 00:08:17,730 And other things, like recommendations and Google 188 00:08:17,730 --> 00:08:19,620 Search, is huge graph. 189 00:08:19,620 --> 00:08:22,710 They build the entire web, then all the recommendations 190 00:08:22,710 --> 00:08:24,660 have to do a huge amount of processing. 191 00:08:24,660 --> 00:08:30,780 So performance matters a lot in these applications. 192 00:08:30,780 --> 00:08:34,770 So let me dive down a little bit deeper into show 193 00:08:34,770 --> 00:08:39,179 what graphs means, what graph processing means. 194 00:08:39,179 --> 00:08:42,659 So one of the very well known graph algorithms 195 00:08:42,659 --> 00:08:47,820 is called PageRank. 196 00:08:47,820 --> 00:08:49,170 Anybody knows [INAUDIBLE] page? 197 00:08:49,170 --> 00:08:51,510 How many have heard of PageRank? 198 00:08:51,510 --> 00:08:55,830 OK, what does page stand in page rank? 199 00:08:55,830 --> 00:08:56,760 AUDIENCE: Larry Page. 200 00:08:56,760 --> 00:08:58,010 SAMAN AMARASIGNHE: Larry Page. 201 00:08:58,010 --> 00:08:59,833 So the first algorithm Google did-- 202 00:08:59,833 --> 00:09:02,250 I don't think this is anywhere near Google at this point-- 203 00:09:02,250 --> 00:09:05,350 was this algorithm, PageRank. 204 00:09:05,350 --> 00:09:07,200 So it ranked these pages. 205 00:09:07,200 --> 00:09:09,030 But it was developed by Larry Page. 206 00:09:09,030 --> 00:09:14,400 So it depends on either page-- it's web pages or Larry Page. 207 00:09:14,400 --> 00:09:15,240 We don't know. 208 00:09:15,240 --> 00:09:18,120 But people think it's Larry Page is PageRank. 209 00:09:18,120 --> 00:09:19,830 So you have a graph in here. 210 00:09:19,830 --> 00:09:21,420 So this graph algorithm, what it does 211 00:09:21,420 --> 00:09:23,760 it is [INAUDIBLE] to some iterations, 212 00:09:23,760 --> 00:09:26,980 either it's max_iter on to some convergence in here. 213 00:09:26,980 --> 00:09:30,720 So what it first do is it will go around, look 214 00:09:30,720 --> 00:09:36,170 at all its neighbors, and calculate, 215 00:09:36,170 --> 00:09:40,490 basically rank a new rank out of all my neighbors. 216 00:09:40,490 --> 00:09:42,470 So that means what is-- 217 00:09:42,470 --> 00:09:43,700 how good are my neighbors? 218 00:09:43,700 --> 00:09:44,540 What's their rank? 219 00:09:44,540 --> 00:09:46,820 And what's their contribution to me? 220 00:09:46,820 --> 00:09:49,640 So it means being known to a good person 221 00:09:49,640 --> 00:09:52,550 and having a connection to a very well known-- 222 00:09:52,550 --> 00:09:55,628 in this case, a super web page-- means I am highly ranked. 223 00:09:55,628 --> 00:09:57,170 So I am more influential, because I'm 224 00:09:57,170 --> 00:09:59,040 closer to something in there. 225 00:09:59,040 --> 00:10:02,240 So what it does is, basically, it will go-- 226 00:10:02,240 --> 00:10:05,013 each node calculates some value, and propagate to all 227 00:10:05,013 --> 00:10:06,305 the neighbors, and aggregating. 228 00:10:06,305 --> 00:10:08,990 So entire graph participating in that. 229 00:10:08,990 --> 00:10:11,060 And then, what happens is each node 230 00:10:11,060 --> 00:10:14,540 will go about calculating its new rank in there. 231 00:10:14,540 --> 00:10:17,760 From looking at old rank, it get modified a little bit 232 00:10:17,760 --> 00:10:18,950 towards a new rank. 233 00:10:18,950 --> 00:10:20,990 And then they swap old ranks and new ranks. 234 00:10:20,990 --> 00:10:24,040 So this is the two computations that you iterate over that. 235 00:10:24,040 --> 00:10:26,690 And you have to do for the entire graph in here. 236 00:10:26,690 --> 00:10:28,050 So, of course, you can run this. 237 00:10:28,050 --> 00:10:30,050 This will run very, very slowly if you run this. 238 00:10:30,050 --> 00:10:31,780 So if you want to get performance, 239 00:10:31,780 --> 00:10:33,590 you write this piece of code. 240 00:10:33,590 --> 00:10:39,480 So this piece of code, basically, is huge. 241 00:10:39,480 --> 00:10:42,500 And it runs 23 times faster than what's 242 00:10:42,500 --> 00:10:44,950 in the previous graph in here on a 12 core machine. 243 00:10:44,950 --> 00:10:47,210 It basically had multi-threaded so we'd 244 00:10:47,210 --> 00:10:48,740 get parallel performance. 245 00:10:48,740 --> 00:10:51,010 It is load balanced because, as you know, 246 00:10:51,010 --> 00:10:53,030 graphs are very unbalanced. 247 00:10:53,030 --> 00:10:54,140 So you get load balance. 248 00:10:54,140 --> 00:10:56,780 If you have non-uniform memory access machines, things 249 00:10:56,780 --> 00:10:59,950 like multiple socket machines, it will take advantage of that. 250 00:10:59,950 --> 00:11:01,130 It advantage of caches-- 251 00:11:01,130 --> 00:11:04,160 a lot of things happening in this piece of code. 252 00:11:04,160 --> 00:11:07,088 But, of course, you know this is hard to write 253 00:11:07,088 --> 00:11:07,880 this piece of code. 254 00:11:07,880 --> 00:11:11,445 But also, worse, you might not know what to do, 255 00:11:11,445 --> 00:11:13,570 what the right optimizing-- you might only iterate. 256 00:11:13,570 --> 00:11:14,870 You might try many things. 257 00:11:14,870 --> 00:11:16,460 And this is very hard. 258 00:11:16,460 --> 00:11:18,360 Every time you change something, if you say, 259 00:11:18,360 --> 00:11:21,220 ah, I want to do something a little bit different, 260 00:11:21,220 --> 00:11:23,510 that I had to write a very complicated piece of code, 261 00:11:23,510 --> 00:11:28,670 get it all right, get everything working before I test in here. 262 00:11:28,670 --> 00:11:33,420 So this is why we can use a DSL for this one. 263 00:11:33,420 --> 00:11:36,890 So let me go a little bit, talk about graph algorithms 264 00:11:36,890 --> 00:11:41,120 and say this seems like a new set of [INAUDIBLE].. 265 00:11:41,120 --> 00:11:43,243 So what do people do with graphs? 266 00:11:43,243 --> 00:11:44,660 So when they say graph algorithms, 267 00:11:44,660 --> 00:11:46,280 I'm going to go a little bit deep down 268 00:11:46,280 --> 00:11:50,570 to show you what type of things represent these graphs. 269 00:11:50,570 --> 00:11:52,190 There's one class of graph algorithms 270 00:11:52,190 --> 00:11:54,560 that are called topology-driven algorithms. 271 00:11:54,560 --> 00:11:57,440 That means the entire graph participates 272 00:11:57,440 --> 00:11:59,690 in the computation. 273 00:11:59,690 --> 00:12:02,240 For example, Google Search-- 274 00:12:02,240 --> 00:12:04,290 before you do Google Search, it will 275 00:12:04,290 --> 00:12:09,160 do the entire basic collection of all the web links. 276 00:12:09,160 --> 00:12:11,120 It will build this huge graph and do 277 00:12:11,120 --> 00:12:14,240 huge amount of processing to basically able 278 00:12:14,240 --> 00:12:16,220 to do the search in here. 279 00:12:16,220 --> 00:12:19,460 Recommendation engine-- so every, probably few weeks, 280 00:12:19,460 --> 00:12:21,620 or whatever it is, it will collect everybody's 281 00:12:21,620 --> 00:12:23,212 recommendations, have this huge data, 282 00:12:23,212 --> 00:12:25,670 and you're going to process that and do this recommendation 283 00:12:25,670 --> 00:12:26,510 engine. 284 00:12:26,510 --> 00:12:28,340 So this is applied for the entire graphs, 285 00:12:28,340 --> 00:12:30,480 and sometimes billions or trillions of nodes 286 00:12:30,480 --> 00:12:33,500 have to go into this computation. 287 00:12:33,500 --> 00:12:37,200 Another set of algorithms is called data-driven algorithms. 288 00:12:37,200 --> 00:12:39,740 So what that means is you start with certain nodes. 289 00:12:39,740 --> 00:12:42,620 And then you keep going to its neighbors and its neighbors 290 00:12:42,620 --> 00:12:47,170 processing data in here as we do that. 291 00:12:47,170 --> 00:12:52,770 And the kind of algorithms that fit in this category 292 00:12:52,770 --> 00:12:54,750 are things like if you have a map, 293 00:12:54,750 --> 00:12:57,000 if I had to find the shortest path, that means I have, 294 00:12:57,000 --> 00:12:58,170 probably two paths. 295 00:12:58,170 --> 00:13:02,910 I don't have to get in from direction from here to Boston. 296 00:13:02,910 --> 00:13:05,310 I don't have to go through New York nodes in New York. 297 00:13:05,310 --> 00:13:07,602 I just have to go through my neighbors connected to me. 298 00:13:07,602 --> 00:13:10,860 So I am basically operating on a certain area 299 00:13:10,860 --> 00:13:13,980 with some connections and processing that. 300 00:13:13,980 --> 00:13:15,480 So these are data-driven algorithms. 301 00:13:15,480 --> 00:13:17,760 So I might have a huge graph. 302 00:13:17,760 --> 00:13:19,680 But my computation might only work 303 00:13:19,680 --> 00:13:22,560 on a small region or a small part of the graph 304 00:13:22,560 --> 00:13:23,886 in these algorithms. 305 00:13:27,540 --> 00:13:31,680 So when you traversing through a graph doing that, 306 00:13:31,680 --> 00:13:34,320 there are multiple ways of doing graph traversals. 307 00:13:34,320 --> 00:13:37,110 And this is why optimization is hard. 308 00:13:37,110 --> 00:13:40,110 Because there are many different ways of doing things. 309 00:13:40,110 --> 00:13:44,820 And each has different set of outcomes you can get. 310 00:13:44,820 --> 00:13:48,030 So I see a lot of graph algorithms. 311 00:13:48,030 --> 00:13:52,020 I need to get something from my neighbors. 312 00:13:52,020 --> 00:13:53,910 One way to get something my neighbors is I 313 00:13:53,910 --> 00:13:56,340 can calculate what the neighbor-- all my neighbors 314 00:13:56,340 --> 00:13:59,980 might want and give it to all the neighbors. 315 00:13:59,980 --> 00:14:03,145 Or I can go change all the neighbors to update my value. 316 00:14:03,145 --> 00:14:04,020 So why do you think-- 317 00:14:04,020 --> 00:14:07,312 OK, you have done some programming in here. 318 00:14:07,312 --> 00:14:08,520 What do you think about this? 319 00:14:08,520 --> 00:14:09,320 Is this a good way? 320 00:14:09,320 --> 00:14:10,710 So if I want to update everybody, 321 00:14:10,710 --> 00:14:13,770 I will calculate what I will do, and I'll 322 00:14:13,770 --> 00:14:16,099 go change everybody, all my neighbors. 323 00:14:21,089 --> 00:14:23,442 AUDIENCE: This is not as parallel as it could be. 324 00:14:23,442 --> 00:14:25,525 SAMAN AMARASIGNHE: Not as parallel as it could be. 325 00:14:25,525 --> 00:14:27,160 I think you are getting to a point. 326 00:14:27,160 --> 00:14:29,650 But why is not that parallel? 327 00:14:29,650 --> 00:14:31,920 AUDIENCE: Well, if you're doing the same thing 328 00:14:31,920 --> 00:14:35,210 with the neighbors, you might as well just tell your neighbors 329 00:14:35,210 --> 00:14:36,620 to do the work for you. 330 00:14:36,620 --> 00:14:37,310 SAMAN AMARASIGNHE: Yeah, but-- 331 00:14:37,310 --> 00:14:38,352 that's a very good point. 332 00:14:38,352 --> 00:14:40,370 So if you are doing a data-driven, if I'm doing, 333 00:14:40,370 --> 00:14:41,150 that's not good. 334 00:14:41,150 --> 00:14:44,645 But if everybody is doing that to their neighbor, 335 00:14:44,645 --> 00:14:45,950 so then I have parallelism. 336 00:14:45,950 --> 00:14:48,200 So everybody might say, you are updated your neighbor. 337 00:14:48,200 --> 00:14:49,575 You are updating your-- everybody 338 00:14:49,575 --> 00:14:50,750 is updating their neighbors. 339 00:14:50,750 --> 00:14:52,790 So now there's another problem showing up. 340 00:14:52,790 --> 00:14:54,500 What's the problem if everybody tried 341 00:14:54,500 --> 00:14:55,900 to update their neighbors? 342 00:14:55,900 --> 00:14:56,400 Back there. 343 00:14:56,400 --> 00:14:57,660 AUDIENCE: There's a determinancy race. 344 00:14:57,660 --> 00:14:58,850 SAMAN AMARASIGNHE: There's a race in there. 345 00:14:58,850 --> 00:15:00,558 Because everybody's going right in there. 346 00:15:00,558 --> 00:15:02,870 So if you want to get this actually right, 347 00:15:02,870 --> 00:15:05,210 you have a bunch of issues here. 348 00:15:05,210 --> 00:15:07,890 You want to basically do atomic updates. 349 00:15:07,890 --> 00:15:09,390 Because you need to lock that thing. 350 00:15:09,390 --> 00:15:12,560 So it has get atomically updated in here. 351 00:15:12,560 --> 00:15:13,280 And this is nice. 352 00:15:13,280 --> 00:15:14,570 But I don't have to traverse anything. 353 00:15:14,570 --> 00:15:16,028 Because everybody I need to update, 354 00:15:16,028 --> 00:15:17,870 I actually go and update it. 355 00:15:17,870 --> 00:15:19,780 That's nice way to do that, especially 356 00:15:19,780 --> 00:15:21,750 if it is not a global thing. 357 00:15:21,750 --> 00:15:24,530 So if I'm propagating, I will update my neighbors. 358 00:15:24,530 --> 00:15:28,760 And I can propagate that down. 359 00:15:28,760 --> 00:15:32,000 Another way to do that is pull schedule. 360 00:15:32,000 --> 00:15:33,620 That means if I-- 361 00:15:33,620 --> 00:15:36,410 everybody look at-- ask their neighbors, OK, do you have-- 362 00:15:36,410 --> 00:15:37,460 what you have to-- 363 00:15:37,460 --> 00:15:38,090 give it to me. 364 00:15:38,090 --> 00:15:39,882 And I collect everything from my neighbors. 365 00:15:39,882 --> 00:15:41,930 And I update myself. 366 00:15:41,930 --> 00:15:43,420 So is there a race condition now? 367 00:15:48,940 --> 00:15:51,220 How many people say there is a race? 368 00:15:51,220 --> 00:15:53,650 How many people think there's no race? 369 00:15:53,650 --> 00:15:56,020 So what happens is I'm reading from all the neighbors. 370 00:15:56,020 --> 00:15:57,687 Everybody is reading from the neighbors. 371 00:15:57,687 --> 00:15:59,020 But I am only updating myself. 372 00:15:59,020 --> 00:16:01,270 So because of that, I'm the only one who's writing me. 373 00:16:01,270 --> 00:16:02,228 So I don't have a race. 374 00:16:02,228 --> 00:16:04,180 So it is really nice you don't have a race. 375 00:16:04,180 --> 00:16:08,380 But I might-- if I'm doing a data-driven transformation, 376 00:16:08,380 --> 00:16:10,830 I might not know that I need to get updated. 377 00:16:10,830 --> 00:16:12,580 Because the update comes from that person. 378 00:16:12,580 --> 00:16:14,122 And that means I might be asking you, 379 00:16:14,122 --> 00:16:15,330 do you have anything to send? 380 00:16:15,330 --> 00:16:16,390 And you might say no. 381 00:16:16,390 --> 00:16:18,670 So in that sense, I might basically 382 00:16:18,670 --> 00:16:21,880 doing a lot of extra computation than necessary. 383 00:16:21,880 --> 00:16:24,970 Because I might not know that I have data I need to get. 384 00:16:24,970 --> 00:16:28,030 But I had to ask you whether I should do this. 385 00:16:28,030 --> 00:16:36,790 But I don't have any, basically, need to do any synchronization. 386 00:16:36,790 --> 00:16:39,700 Another interesting thing is I can take this graph, 387 00:16:39,700 --> 00:16:42,690 and I can basically partition the graph. 388 00:16:42,690 --> 00:16:45,010 And once I partition the graph, I 389 00:16:45,010 --> 00:16:47,800 can basically say, OK, this core get this graph. 390 00:16:47,800 --> 00:16:49,540 This core get this graph. 391 00:16:49,540 --> 00:16:51,875 Or this processor node get this graph. 392 00:16:51,875 --> 00:16:53,750 What's the advantage of partitioning a graph? 393 00:16:56,430 --> 00:16:58,430 Why do I want to partition a graph-- large graph 394 00:16:58,430 --> 00:16:59,450 into small pieces? 395 00:17:05,400 --> 00:17:07,200 Of course, you had to do a good partition. 396 00:17:07,200 --> 00:17:09,339 You can't do arbitrary partition. 397 00:17:09,339 --> 00:17:11,910 So what happens if I do a good partitioning? 398 00:17:11,910 --> 00:17:14,430 I don't tell the word, because then the answer comes out 399 00:17:14,430 --> 00:17:15,964 in there. 400 00:17:15,964 --> 00:17:18,089 OK, let me see if anybody else, you have answered-- 401 00:17:18,089 --> 00:17:20,520 anybody else want to answer? 402 00:17:20,520 --> 00:17:21,720 Come on. 403 00:17:21,720 --> 00:17:24,670 You have to-- [INAUDIBLE]. 404 00:17:24,670 --> 00:17:28,650 What happened if I take apart, and find two different groups, 405 00:17:28,650 --> 00:17:31,170 and separate them, and give this one to one and this one 406 00:17:31,170 --> 00:17:31,670 to another? 407 00:17:31,670 --> 00:17:33,020 What do I get? 408 00:17:33,020 --> 00:17:34,960 AUDIENCE: You get some parallelism. 409 00:17:34,960 --> 00:17:36,752 SAMAN AMARASIGNHE: I get parallelism, also. 410 00:17:36,752 --> 00:17:39,270 But other thing, if I have a lot of connected things going 411 00:17:39,270 --> 00:17:41,312 to-- these connected things going to that person, 412 00:17:41,312 --> 00:17:44,510 what else can I get? 413 00:17:44,510 --> 00:17:46,380 Locality-- have you heard? 414 00:17:46,380 --> 00:17:48,580 Did you do locality in the class? 415 00:17:48,580 --> 00:17:51,720 So that means-- the partition means my-- 416 00:17:51,720 --> 00:17:54,660 the thing I'm working, I am only working on a small amount. 417 00:17:54,660 --> 00:17:57,490 And that might, if I'm lucky, fit in my cache. 418 00:17:57,490 --> 00:17:59,730 And that would be very nice then everybody's has 419 00:17:59,730 --> 00:18:03,690 to go to every node in here. 420 00:18:03,690 --> 00:18:07,110 So if I partition this properly, I 421 00:18:07,110 --> 00:18:09,200 will get good locality in here. 422 00:18:09,200 --> 00:18:13,120 It's actually written there, whoops. 423 00:18:13,120 --> 00:18:15,120 So my answer was in there, so improved locality. 424 00:18:15,120 --> 00:18:18,110 But, of course, now I might have a little bit extra overhead. 425 00:18:18,110 --> 00:18:20,830 Because I know I might have to replicate some nodes, 426 00:18:20,830 --> 00:18:21,510 stuff like that. 427 00:18:21,510 --> 00:18:22,680 Because it's in both sides. 428 00:18:27,980 --> 00:18:31,250 So another interesting in properties of graphs 429 00:18:31,250 --> 00:18:33,620 is when you look at data structures until now, 430 00:18:33,620 --> 00:18:35,795 things like arrays, the size matters. 431 00:18:35,795 --> 00:18:37,670 These represent what array fits in the cache, 432 00:18:37,670 --> 00:18:38,700 and stuff like that. 433 00:18:38,700 --> 00:18:40,940 Graphs, there are some other properties 434 00:18:40,940 --> 00:18:44,270 of the graphs in here. 435 00:18:44,270 --> 00:18:48,030 So if you go to social networks-- 436 00:18:48,030 --> 00:18:50,030 social network is a graph-- 437 00:18:50,030 --> 00:18:52,700 what's the interesting property in social networks 438 00:18:52,700 --> 00:18:53,510 you have observed? 439 00:18:58,180 --> 00:18:59,180 AUDIENCE: Connectedness. 440 00:18:59,180 --> 00:19:00,240 SAMAN AMARASIGNHE: Connectedness-- 441 00:19:00,240 --> 00:19:02,010 there are people like me that probably 442 00:19:02,010 --> 00:19:04,350 have 20 friends in there and has a very 443 00:19:04,350 --> 00:19:05,590 little number of connections. 444 00:19:05,590 --> 00:19:07,590 And then there are celebrities who have millions 445 00:19:07,590 --> 00:19:09,120 of connections in here. 446 00:19:09,120 --> 00:19:11,790 So the interesting thing is, if you look at a social network 447 00:19:11,790 --> 00:19:16,260 graph, you have this relationship called power law 448 00:19:16,260 --> 00:19:17,520 relationship. 449 00:19:17,520 --> 00:19:19,880 That means-- there's exponential code. 450 00:19:19,880 --> 00:19:24,960 There are some people here, like very well-known celebrities 451 00:19:24,960 --> 00:19:28,410 that might have millions and millions of users in here-- 452 00:19:28,410 --> 00:19:30,390 connections in neighbors, or likes, 453 00:19:30,390 --> 00:19:32,252 or whatever it is in that node. 454 00:19:32,252 --> 00:19:33,960 And there are people like me sitting here 455 00:19:33,960 --> 00:19:36,420 that has very little people connected 456 00:19:36,420 --> 00:19:37,540 to the rest of the world. 457 00:19:37,540 --> 00:19:40,200 So this is normally-- people have observed these big social 458 00:19:40,200 --> 00:19:41,208 network type graphs-- 459 00:19:41,208 --> 00:19:43,125 you have this kind of exponential relationship 460 00:19:43,125 --> 00:19:44,490 in here. 461 00:19:44,490 --> 00:19:47,430 So the web has exponential relationship. 462 00:19:47,430 --> 00:19:50,220 A social network has this kind of relationship in there. 463 00:19:50,220 --> 00:19:52,950 So those things you have to do very interesting things 464 00:19:52,950 --> 00:19:54,407 when you process these graphs. 465 00:19:54,407 --> 00:19:56,490 Because there are certain connections that matter, 466 00:19:56,490 --> 00:19:58,560 certain nodes that matter a lot, or has 467 00:19:58,560 --> 00:19:59,850 a big impact than other nodes. 468 00:20:02,930 --> 00:20:05,040 Then there are other graphs that have 469 00:20:05,040 --> 00:20:06,290 a bounded-degree distribution. 470 00:20:06,290 --> 00:20:10,070 If you have a road network, the maximum connection, 471 00:20:10,070 --> 00:20:12,080 probably you might have an intersection 472 00:20:12,080 --> 00:20:14,490 that has six roads coming to together in there. 473 00:20:14,490 --> 00:20:16,920 You don't have a million roads connecting into one place 474 00:20:16,920 --> 00:20:17,670 anywhere in there. 475 00:20:17,670 --> 00:20:18,750 So that doesn't happen. 476 00:20:18,750 --> 00:20:21,320 So this is a lot more flatter, a lot more 477 00:20:21,320 --> 00:20:24,107 bounded-degree distribution graphs in here. 478 00:20:24,107 --> 00:20:25,940 They have lots of excellent locality in here 479 00:20:25,940 --> 00:20:29,450 because, of course, all the roads in Cambridge 480 00:20:29,450 --> 00:20:30,290 might be connected. 481 00:20:30,290 --> 00:20:32,420 But roads in Cambridge can be separated 482 00:20:32,420 --> 00:20:34,173 from roads in New York City. 483 00:20:34,173 --> 00:20:35,340 So there they are separated. 484 00:20:35,340 --> 00:20:36,760 They are locality-- nice locality 485 00:20:36,760 --> 00:20:37,890 in these kind of graphs. 486 00:20:37,890 --> 00:20:41,120 So even the-- if you say the graph be 487 00:20:41,120 --> 00:20:43,040 the same size, the shape of the graph 488 00:20:43,040 --> 00:20:44,720 matters in computation, a lot of times. 489 00:20:47,820 --> 00:20:50,360 So what happens is now when you want 490 00:20:50,360 --> 00:20:51,880 to operate on these graphs, you have 491 00:20:51,880 --> 00:20:55,500 to look at three interesting properties. 492 00:20:55,500 --> 00:20:58,010 One property is, OK, how much parallelism 493 00:20:58,010 --> 00:21:00,080 my algorithm, what I'm trying to do to this graph 494 00:21:00,080 --> 00:21:02,270 is going to get? 495 00:21:02,270 --> 00:21:03,760 It's like a Goldilocks type thing. 496 00:21:03,760 --> 00:21:05,540 You don't want too much parallelism. 497 00:21:05,540 --> 00:21:07,530 If you say, I have algorithm that huge amount 498 00:21:07,530 --> 00:21:10,517 of parallelism, if I can't take advantage, it's not useful. 499 00:21:10,517 --> 00:21:12,350 So you need to get a parallelism good enough 500 00:21:12,350 --> 00:21:14,570 that I can actually use it. 501 00:21:14,570 --> 00:21:17,240 Then I really like to have locality. 502 00:21:17,240 --> 00:21:21,260 Because if I have a locality, my caches will work. 503 00:21:21,260 --> 00:21:22,460 Everything will be nearby. 504 00:21:22,460 --> 00:21:23,772 I can get-- runs things fast. 505 00:21:23,772 --> 00:21:26,230 If I, every time, I have to get something from main memory, 506 00:21:26,230 --> 00:21:27,460 it can be very, very slow. 507 00:21:27,460 --> 00:21:29,630 So I want to get locality. 508 00:21:29,630 --> 00:21:31,250 But the interesting thing about graphs 509 00:21:31,250 --> 00:21:34,550 is to get localities and get some of these, 510 00:21:34,550 --> 00:21:37,050 you might have to do some extra work. 511 00:21:37,050 --> 00:21:39,350 So if you saw that graph got divided 512 00:21:39,350 --> 00:21:42,590 into two different graphs, I had to add extra nodes in here. 513 00:21:42,590 --> 00:21:44,340 I might write some extra data structures, 514 00:21:44,340 --> 00:21:45,800 so do some extra computation. 515 00:21:45,800 --> 00:21:48,380 So I might have to do some extra work in here. 516 00:21:48,380 --> 00:21:52,190 So in certain things, I might not be that work efficient. 517 00:21:52,190 --> 00:21:54,870 So I might get really good parallelism and locality, 518 00:21:54,870 --> 00:21:56,280 but I am doing too much work. 519 00:21:56,280 --> 00:21:58,880 So, for example, if I want to-- 520 00:21:58,880 --> 00:22:04,375 assume I want to find one node's neighbor, 521 00:22:04,375 --> 00:22:05,750 very way to get good parallelism, 522 00:22:05,750 --> 00:22:08,730 everybody finds their neighbor. 523 00:22:08,730 --> 00:22:09,995 OK, but that's not efficient. 524 00:22:09,995 --> 00:22:11,870 I mean, most of the computation's not useful. 525 00:22:11,870 --> 00:22:13,880 So there, you can do things that you are doing 526 00:22:13,880 --> 00:22:15,590 extra work than necessary. 527 00:22:15,590 --> 00:22:17,930 Then that can get much faster other things. 528 00:22:17,930 --> 00:22:20,870 But you have to be careful on doing that. 529 00:22:20,870 --> 00:22:22,850 So you have this balance in there. 530 00:22:22,850 --> 00:22:26,720 So certain algorithms will fit in different places in here 531 00:22:26,720 --> 00:22:27,690 in this tradeoff space. 532 00:22:27,690 --> 00:22:29,565 So push algorithm will fit in here. 533 00:22:29,565 --> 00:22:31,190 So, for example, if you go to something 534 00:22:31,190 --> 00:22:33,470 like a pull algorithm, what you might find 535 00:22:33,470 --> 00:22:36,560 is you are doing less work efficient. 536 00:22:36,560 --> 00:22:38,550 Because you might do a little bit more work. 537 00:22:38,550 --> 00:22:41,760 But it might be better in locality and parallelism, 538 00:22:41,760 --> 00:22:44,030 because you don't have to do locks in here. 539 00:22:44,030 --> 00:22:46,310 And then you do something like partitioning. 540 00:22:46,310 --> 00:22:48,620 It gets really good locality in partitioning. 541 00:22:48,620 --> 00:22:50,000 But you are doing extra work. 542 00:22:50,000 --> 00:22:51,500 And also, because in your partition, 543 00:22:51,500 --> 00:22:53,820 you might limit your parallelism in here. 544 00:22:53,820 --> 00:22:55,460 So you might less parallelism, but you 545 00:22:55,460 --> 00:22:56,780 get really good locality. 546 00:22:56,780 --> 00:23:00,080 So all this is basically large tradeoff space in here. 547 00:23:00,080 --> 00:23:02,510 And then when you keep adding more and more things 548 00:23:02,510 --> 00:23:06,870 you can do, it fits into this big tradeoff space. 549 00:23:06,870 --> 00:23:10,250 So how do you decide what to go in the tradeoff space is a very 550 00:23:10,250 --> 00:23:11,930 important thing-- decision. 551 00:23:11,930 --> 00:23:13,450 So it depends on the graphs. 552 00:23:13,450 --> 00:23:16,220 If you have power law graphs, you might want to do something. 553 00:23:16,220 --> 00:23:21,170 If you have a more limited distributed graph, 554 00:23:21,170 --> 00:23:22,742 you want to do something else. 555 00:23:22,742 --> 00:23:24,200 And the power law graphs, sometimes 556 00:23:24,200 --> 00:23:27,560 you might do something different for the high connected edges 557 00:23:27,560 --> 00:23:28,190 versus others. 558 00:23:28,190 --> 00:23:30,750 You might not even differentiate between that. 559 00:23:30,750 --> 00:23:32,520 It depends on the algorithm. 560 00:23:32,520 --> 00:23:34,222 So if you are doing-- 561 00:23:34,222 --> 00:23:36,680 visiting all the nodes, whereas as a data-driven algorithm, 562 00:23:36,680 --> 00:23:38,490 you might do something different. 563 00:23:38,490 --> 00:23:41,370 It also depends on the hardware you're running. 564 00:23:41,370 --> 00:23:45,380 So, for example, if you are doing a Google search, 565 00:23:45,380 --> 00:23:46,970 basically indexing, you're running 566 00:23:46,970 --> 00:23:50,450 an algorithm that has to operate on the entire graph in here. 567 00:23:50,450 --> 00:23:52,910 And the graph is a power law graph in that. 568 00:23:52,910 --> 00:23:54,840 And you're running on a cluster. 569 00:23:54,840 --> 00:23:56,510 So the right thing might be something 570 00:23:56,510 --> 00:23:59,330 like a pull schedule with some partitioning and something 571 00:23:59,330 --> 00:24:01,010 like a vertex parallel, or some kind 572 00:24:01,010 --> 00:24:02,930 of a parallelism scheme in here might give you 573 00:24:02,930 --> 00:24:05,330 the best performance. 574 00:24:05,330 --> 00:24:06,830 But in the other side of the Google, 575 00:24:06,830 --> 00:24:08,300 if you're trying to do a map, and if you're 576 00:24:08,300 --> 00:24:10,010 trying to give you directions, you 577 00:24:10,010 --> 00:24:12,620 have a very different type of a graph. 578 00:24:12,620 --> 00:24:14,840 You are doing a data-driven algorithm in that graph. 579 00:24:14,840 --> 00:24:16,640 And you might be running on a single machine. 580 00:24:16,640 --> 00:24:18,057 Because you need to give direction 581 00:24:18,057 --> 00:24:20,148 fast for each individual time. 582 00:24:20,148 --> 00:24:22,190 You might have a very different type of algorithm 583 00:24:22,190 --> 00:24:23,690 you want to run this graph, the push 584 00:24:23,690 --> 00:24:26,660 algorithm in vertex parallel, perhaps, 585 00:24:26,660 --> 00:24:28,550 some combination in there. 586 00:24:28,550 --> 00:24:31,910 And, of course, if you get a bad algorithm or bad set of-- 587 00:24:31,910 --> 00:24:34,610 way of doing it, you can be very bad. 588 00:24:34,610 --> 00:24:36,770 You can get hundreds of thousands times slower 589 00:24:36,770 --> 00:24:38,370 than the best you can achieve. 590 00:24:38,370 --> 00:24:40,700 So it matters to find the right thing, 591 00:24:40,700 --> 00:24:42,230 right way of doing things. 592 00:24:42,230 --> 00:24:44,520 So this is where GraphIt came in. 593 00:24:44,520 --> 00:24:46,760 GraphIt is a domain specific language, basically, 594 00:24:46,760 --> 00:24:47,810 we developed. 595 00:24:47,810 --> 00:24:51,950 And one thing GraphIt did was we said, OK, look, 596 00:24:51,950 --> 00:24:54,380 the algorithm is mostly constant. 597 00:24:54,380 --> 00:24:57,410 But how you process the-- how you go about it 598 00:24:57,410 --> 00:24:58,800 is very different. 599 00:24:58,800 --> 00:25:01,070 So we want to separate these things. 600 00:25:01,070 --> 00:25:03,780 So the first thing we did was come up 601 00:25:03,780 --> 00:25:06,490 with the algorithm, which is what do you want to compute? 602 00:25:06,490 --> 00:25:07,870 It's very high level. 603 00:25:07,870 --> 00:25:10,740 It don't tell you how we are computing that-- saying this 604 00:25:10,740 --> 00:25:11,860 is my algorithm. 605 00:25:11,860 --> 00:25:13,370 I aim to process these nodes. 606 00:25:13,370 --> 00:25:17,210 And this is the computation I want to do in there. 607 00:25:17,210 --> 00:25:20,760 And you separate it with an optimizational schedule 608 00:25:20,760 --> 00:25:22,030 how to compute. 609 00:25:22,030 --> 00:25:24,340 So we'd say, OK, to do this algorithm, 610 00:25:24,340 --> 00:25:27,680 you had to do a push schedule, do this type of parallelism-- 611 00:25:27,680 --> 00:25:28,500 each separately. 612 00:25:28,500 --> 00:25:31,940 And the nice thing is that is now, if the graph changed 613 00:25:31,940 --> 00:25:34,140 or if the matching changed, I can give you 614 00:25:34,140 --> 00:25:36,480 a different schedule in here. 615 00:25:36,480 --> 00:25:38,670 So let me show you some examples. 616 00:25:38,670 --> 00:25:41,130 First, look at the algorithm in here. 617 00:25:41,130 --> 00:25:45,190 So we show three different types of things you want to do. 618 00:25:45,190 --> 00:25:47,180 So you want to do the entire graph in here, 619 00:25:47,180 --> 00:25:48,180 have the data-driven. 620 00:25:48,180 --> 00:25:53,040 Or I might want to just operate on the vertices in here. 621 00:25:53,040 --> 00:25:56,070 So this one we have-- 622 00:25:56,070 --> 00:25:58,610 the language provides a very simple way of doing that. 623 00:25:58,610 --> 00:26:00,090 Language has this function saying, 624 00:26:00,090 --> 00:26:03,180 if there are edges, all the edges of the graph, apply-- 625 00:26:03,180 --> 00:26:04,440 you can give a function. 626 00:26:04,440 --> 00:26:08,530 The function takes the, basically, nodes and the edges, 627 00:26:08,530 --> 00:26:10,080 basically-- it to basically carry out 628 00:26:10,080 --> 00:26:12,163 this computation, a very simple way of doing that. 629 00:26:12,163 --> 00:26:13,530 So this is the representation. 630 00:26:13,530 --> 00:26:16,290 So the nice thing, the simplicity of programming now. 631 00:26:16,290 --> 00:26:20,670 If I write it in C, it will look like a big blob of ugly code. 632 00:26:20,670 --> 00:26:23,550 In the domain specific language, all you have to write is this-- 633 00:26:23,550 --> 00:26:25,350 make life very simple. 634 00:26:25,350 --> 00:26:27,060 Or if you're a data-driven language, 635 00:26:27,060 --> 00:26:30,060 I have to say, OK, I start with this set of vertices 636 00:26:30,060 --> 00:26:32,240 to compute in here. 637 00:26:32,240 --> 00:26:34,600 And here are the vertices I am going to in here, 638 00:26:34,600 --> 00:26:36,472 the vertex sent here. 639 00:26:36,472 --> 00:26:37,680 And then I do some filtering. 640 00:26:37,680 --> 00:26:39,360 Because I might not go visit everybody. 641 00:26:39,360 --> 00:26:43,072 There are some filtering of what you can do. 642 00:26:43,072 --> 00:26:45,030 And then once you figure out exactly the things 643 00:26:45,030 --> 00:26:46,130 you are computing, here's a function 644 00:26:46,130 --> 00:26:47,460 to go and apply to that. 645 00:26:47,460 --> 00:26:50,640 So I can give you some very nice way of basically subsetting 646 00:26:50,640 --> 00:26:52,510 my graph with certain properties, 647 00:26:52,510 --> 00:26:54,770 select those things, and now go compute there. 648 00:26:54,770 --> 00:26:56,800 And if you're only doing vertices, 649 00:26:56,800 --> 00:26:58,860 say, OK, for each vertices, again, I 650 00:26:58,860 --> 00:27:01,050 can filter, saying this subset or something go 651 00:27:01,050 --> 00:27:02,250 to that computation. 652 00:27:02,250 --> 00:27:04,378 So language-wise, it's very simple. 653 00:27:04,378 --> 00:27:05,670 This is what all you had to do. 654 00:27:05,670 --> 00:27:10,770 Now if you look at PageRank, PageRank 655 00:27:10,770 --> 00:27:12,670 has two interesting update functions. 656 00:27:12,670 --> 00:27:15,570 What is-- one is updating, going-- 657 00:27:15,570 --> 00:27:16,390 looking at edges. 658 00:27:16,390 --> 00:27:19,770 So what it says is new rank, I get the destination edge. 659 00:27:19,770 --> 00:27:22,830 And it gets updated using all the source edges in here. 660 00:27:22,830 --> 00:27:25,710 This is the update function, very simple update function. 661 00:27:25,710 --> 00:27:29,220 And then once you do that for each, basically, vertex, 662 00:27:29,220 --> 00:27:30,870 I go do internal update. 663 00:27:30,870 --> 00:27:34,620 I give these two functions and put them together into driver. 664 00:27:34,620 --> 00:27:37,620 And the driver says run this function, run this function, 665 00:27:37,620 --> 00:27:39,420 and I'm done. 666 00:27:39,420 --> 00:27:42,720 OK, so I can write this code at higher level, much 667 00:27:42,720 --> 00:27:45,490 simpler, much nicer, much more elegant way. 668 00:27:45,490 --> 00:27:46,990 It's much easier to understand. 669 00:27:46,990 --> 00:27:49,620 It's easier than even the simple C++ code to understand 670 00:27:49,620 --> 00:27:52,520 what's going on if you write it in this way. 671 00:27:52,520 --> 00:27:55,830 So this is the first advantage of a domain specific language. 672 00:27:55,830 --> 00:27:57,030 I can do this. 673 00:27:57,030 --> 00:27:58,740 Then the next thing you can do is now 674 00:27:58,740 --> 00:28:01,130 I can come up with the schedule. 675 00:28:01,130 --> 00:28:02,903 So schedules should be easy to use. 676 00:28:02,903 --> 00:28:04,320 And it should be powerful enough I 677 00:28:04,320 --> 00:28:06,720 should be able to get the best speed possible. 678 00:28:06,720 --> 00:28:08,640 Because I can tell you all the crazy things 679 00:28:08,640 --> 00:28:10,540 I can do to the code. 680 00:28:10,540 --> 00:28:15,660 So here's my program here for PageRank. 681 00:28:15,660 --> 00:28:18,600 And so what I can do is, for this algorithm, 682 00:28:18,600 --> 00:28:21,110 I can provide this schedule in here. 683 00:28:21,110 --> 00:28:25,020 And this schedule basically says, OK, look at this guy, s1. 684 00:28:25,020 --> 00:28:27,030 I marked it in there. 685 00:28:27,030 --> 00:28:31,150 For s1, I want to do SparsePush type computation. 686 00:28:31,150 --> 00:28:33,780 This is how I want to process this one. 687 00:28:33,780 --> 00:28:36,960 And then, by looking at that, I can generate a pseudo code 688 00:28:36,960 --> 00:28:39,420 that looks like this that basically first goes 689 00:28:39,420 --> 00:28:41,850 through a source node, because I'm doing push 690 00:28:41,850 --> 00:28:43,173 from source to destination. 691 00:28:43,173 --> 00:28:45,090 And then I'm going through all the destination 692 00:28:45,090 --> 00:28:46,630 nodes of that source. 693 00:28:46,630 --> 00:28:48,790 And I'm going to actually go and update them. 694 00:28:48,790 --> 00:28:52,077 So I can do this very simple updating here. 695 00:28:52,077 --> 00:28:53,910 But this might not get you that performance. 696 00:28:53,910 --> 00:28:57,330 I say, ah ha, I want to do this in parallelism. 697 00:28:57,330 --> 00:28:59,130 I want to run this parallel. 698 00:28:59,130 --> 00:29:01,470 And then when I do that, it will automatically generate, 699 00:29:01,470 --> 00:29:04,283 say ah ha, now, I will make this two parallel. 700 00:29:04,283 --> 00:29:05,700 And now I can't do simple updates. 701 00:29:05,700 --> 00:29:06,950 I have to atomic add. 702 00:29:06,950 --> 00:29:08,790 So here's my atomic add operation-- 703 00:29:08,790 --> 00:29:11,010 so the graph in here. 704 00:29:11,010 --> 00:29:13,860 Then you might think, and say, mm, do I want to do the push? 705 00:29:13,860 --> 00:29:15,420 Can I do a pull? 706 00:29:15,420 --> 00:29:19,210 So if I do a pull chain, it will basically switch the-- 707 00:29:19,210 --> 00:29:19,710 in here. 708 00:29:19,710 --> 00:29:21,690 Now I am going from destination to source. 709 00:29:21,690 --> 00:29:23,163 I changed order in there. 710 00:29:23,163 --> 00:29:25,080 And now I don't have to do that atomic update. 711 00:29:25,080 --> 00:29:30,270 Because I am pulling everything to my node and updating here. 712 00:29:30,270 --> 00:29:32,280 And then, of course, if you want to do some kind 713 00:29:32,280 --> 00:29:34,480 of partitioning, I can also say partitioning, 714 00:29:34,480 --> 00:29:36,603 it's-- now we created a sub-graph in here. 715 00:29:36,603 --> 00:29:38,770 And for the sub-graph, I am doing this partitioning. 716 00:29:38,770 --> 00:29:40,620 So I can keep changing all these things. 717 00:29:40,620 --> 00:29:42,270 Look, I didn't touch this. 718 00:29:42,270 --> 00:29:44,430 My algorithm still stays same. 719 00:29:44,430 --> 00:29:46,350 I'm changing my scheduling. 720 00:29:46,350 --> 00:29:49,710 I can play with this schedule. 721 00:29:49,710 --> 00:29:51,270 Nice thing about that is now if you 722 00:29:51,270 --> 00:29:53,200 keep playing with the schedule, here's 723 00:29:53,200 --> 00:29:54,450 the kind of performance I get. 724 00:29:54,450 --> 00:29:57,420 The first guy was sequential, pretty bad performance. 725 00:29:57,420 --> 00:29:59,400 The next guy, I just parallelized in here. 726 00:29:59,400 --> 00:30:01,760 I got some performance in here. 727 00:30:01,760 --> 00:30:04,160 But it had all the synchronization. 728 00:30:04,160 --> 00:30:06,210 So I changed the order of execution. 729 00:30:06,210 --> 00:30:08,180 And I got an even better performance. 730 00:30:08,180 --> 00:30:10,430 And now I partitioned, got [INAUDIBLE] performance. 731 00:30:10,430 --> 00:30:12,020 So this is the order of doing that. 732 00:30:12,020 --> 00:30:14,270 But, of course, you can play with many, many different 733 00:30:14,270 --> 00:30:15,600 combinations. 734 00:30:15,600 --> 00:30:18,890 And what GraphIt has is huge number 735 00:30:18,890 --> 00:30:22,260 of different combinations you can play with. 736 00:30:22,260 --> 00:30:24,380 So there are a lot of different optimizations. 737 00:30:24,380 --> 00:30:26,150 You can do direction optimizations, 738 00:30:26,150 --> 00:30:29,130 push, pull, doing a sparse, dense, different 739 00:30:29,130 --> 00:30:34,610 parallelization, cache, NUMA optimization, and also 740 00:30:34,610 --> 00:30:37,340 data layout, things like structures of arrays, 741 00:30:37,340 --> 00:30:41,450 array of structure layout, additional data structures that 742 00:30:41,450 --> 00:30:42,350 simplify computation. 743 00:30:42,350 --> 00:30:44,820 All these things I can specify in here. 744 00:30:44,820 --> 00:30:46,070 And then you can play with it. 745 00:30:46,070 --> 00:30:48,480 It's not clear which one wins. 746 00:30:48,480 --> 00:30:51,170 It depends on the algorithm, depending on the graph shape, 747 00:30:51,170 --> 00:30:53,520 graph size, depending on the machine you run. 748 00:30:53,520 --> 00:30:56,833 So most of the time, if you are a performance engineer, 749 00:30:56,833 --> 00:30:58,250 you'll be trying different things, 750 00:30:58,250 --> 00:31:01,087 and looking at the performance, and say, this 751 00:31:01,087 --> 00:31:02,420 doesn't get good cache behavior. 752 00:31:02,420 --> 00:31:03,800 OK, let me try different things. 753 00:31:03,800 --> 00:31:04,960 So you want to iterate. 754 00:31:04,960 --> 00:31:06,710 And these iterations, you want to do fast. 755 00:31:06,710 --> 00:31:08,600 And this will do that. 756 00:31:08,600 --> 00:31:10,970 So let me tell you a little bit of results. 757 00:31:10,970 --> 00:31:11,740 This is a-- 758 00:31:11,740 --> 00:31:12,590 I have to explain. 759 00:31:12,590 --> 00:31:15,330 This a little bit of a complicated graph. 760 00:31:15,330 --> 00:31:17,840 So what we looked at was shown against bunch 761 00:31:17,840 --> 00:31:21,500 of different benchmarks, a bunch of different frameworks 762 00:31:21,500 --> 00:31:22,760 that do graphs. 763 00:31:22,760 --> 00:31:27,290 So what they says is here is a program, PageRank, ran all 764 00:31:27,290 --> 00:31:29,730 on a graph like general graph in here. 765 00:31:29,730 --> 00:31:31,370 One means it ran the fastest. 766 00:31:31,370 --> 00:31:33,520 This ran about 8% slower. 767 00:31:33,520 --> 00:31:35,550 This ran 50% slower. 768 00:31:35,550 --> 00:31:39,860 This ran 3x slower, and 8x lower for that graph. 769 00:31:39,860 --> 00:31:42,410 The interesting thing is as you add more different graphs, 770 00:31:42,410 --> 00:31:43,890 the performance changes. 771 00:31:43,890 --> 00:31:46,430 So, in fact, even though we ran fastest 772 00:31:46,430 --> 00:31:50,450 for this road graph, which are a very different type of graph, 773 00:31:50,450 --> 00:31:52,910 this framework ran the-- 774 00:31:52,910 --> 00:31:54,470 provided-- got the fastest result. 775 00:31:54,470 --> 00:31:55,850 Because the graph is different. 776 00:31:55,850 --> 00:31:58,855 So it might be doing something that's better. 777 00:31:58,855 --> 00:32:00,230 The interesting thing is since we 778 00:32:00,230 --> 00:32:02,420 had-- because most of other frameworks 779 00:32:02,420 --> 00:32:05,030 will have a couple of built in things they try. 780 00:32:05,030 --> 00:32:07,010 They don't do, give you all this ability 781 00:32:07,010 --> 00:32:08,250 to try all this optimizing. 782 00:32:08,250 --> 00:32:09,530 They say, ah ha, I know this. 783 00:32:09,530 --> 00:32:10,070 This is really good. 784 00:32:10,070 --> 00:32:10,880 I will do that. 785 00:32:10,880 --> 00:32:13,280 It works for certain things, not for everybody. 786 00:32:13,280 --> 00:32:18,530 And so if you look at the entire different breadth-first search, 787 00:32:18,530 --> 00:32:22,990 connected components, shortest path algorithms, what you find 788 00:32:22,990 --> 00:32:27,900 is some frameworks are good sometimes. 789 00:32:27,900 --> 00:32:29,820 They might be really bad in other times, 790 00:32:29,820 --> 00:32:32,090 to either some algorithms, some type of data, 791 00:32:32,090 --> 00:32:33,350 they can be really bad. 792 00:32:33,350 --> 00:32:36,890 So this algorithm was really kind of good at this data set, 793 00:32:36,890 --> 00:32:39,530 but really bad in this data, and really kind 794 00:32:39,530 --> 00:32:41,100 of not good in this algorithm. 795 00:32:41,100 --> 00:32:43,020 We are most of the time good all the time. 796 00:32:43,020 --> 00:32:46,850 The reason is we don't make a few decisions. 797 00:32:46,850 --> 00:32:49,520 In GraphIt, what it will do is it will give you this ability 798 00:32:49,520 --> 00:32:51,860 to try different things. 799 00:32:51,860 --> 00:32:55,430 And depending on the graph, depending on the algorithm, 800 00:32:55,430 --> 00:32:58,680 some optimizations might work better than the other. 801 00:32:58,680 --> 00:33:01,250 This is exactly what you guys have been doing in the class. 802 00:33:01,250 --> 00:33:04,803 You are trying different optimizations by hand. 803 00:33:04,803 --> 00:33:07,220 The difference is every time you thought about optimizing, 804 00:33:07,220 --> 00:33:10,370 you had to go change the entire program to make that work. 805 00:33:10,370 --> 00:33:12,780 Here you just change the scheduling language one way, 806 00:33:12,780 --> 00:33:16,960 recompile, run, measure, and you can do this fast. 807 00:33:16,960 --> 00:33:19,820 Any questions so far before I switch gears? 808 00:33:24,790 --> 00:33:26,290 AUDIENCE: [INAUDIBLE] 809 00:33:26,290 --> 00:33:28,130 SAMAN AMARASIGNHE: OK. 810 00:33:28,130 --> 00:33:31,690 So I'm going to switch to another domain 811 00:33:31,690 --> 00:33:34,560 specific language. 812 00:33:34,560 --> 00:33:37,040 You will find a lot of simulated, lot of parallelism 813 00:33:37,040 --> 00:33:37,540 in here. 814 00:33:37,540 --> 00:33:38,707 This was intentional. 815 00:33:38,707 --> 00:33:40,540 I could have talked on many different domain 816 00:33:40,540 --> 00:33:41,420 specific languages. 817 00:33:41,420 --> 00:33:44,770 But I took another one that you-- almost 818 00:33:44,770 --> 00:33:47,470 has kind of a mirror similarities of what's 819 00:33:47,470 --> 00:33:48,280 going on. 820 00:33:48,280 --> 00:33:51,120 And you will see a pattern in here, hopefully. 821 00:33:51,120 --> 00:33:54,370 And after this, I will ask you what the patterns are. 822 00:33:54,370 --> 00:33:56,600 This language is Halide. 823 00:33:56,600 --> 00:33:59,890 It was originally developed for image processing. 824 00:33:59,890 --> 00:34:03,580 And its focus is-- the graphs focus on this past graph data 825 00:34:03,580 --> 00:34:04,400 structures. 826 00:34:04,400 --> 00:34:07,210 Halide's focused on-- because images 827 00:34:07,210 --> 00:34:09,219 are dense, regular structures. 828 00:34:09,219 --> 00:34:11,139 You do regular computation on the images. 829 00:34:11,139 --> 00:34:12,520 And you process this thing. 830 00:34:12,520 --> 00:34:14,080 And you have a very complex pipeline. 831 00:34:14,080 --> 00:34:15,497 Like, for example, camera pipeline 832 00:34:15,497 --> 00:34:18,460 do many very complex algorithms to the image 833 00:34:18,460 --> 00:34:22,719 before you get from the bits coming out of your CCD 834 00:34:22,719 --> 00:34:25,840 to the beautiful picture you see in Facebook. 835 00:34:30,650 --> 00:34:33,000 And the primary goal of Halide was 836 00:34:33,000 --> 00:34:35,610 you want to match and exceed the hand-optimized performance, 837 00:34:35,610 --> 00:34:36,110 basically. 838 00:34:36,110 --> 00:34:38,600 This was the property we want to do. 839 00:34:38,600 --> 00:34:41,400 And we want to reduce the rote amount of programming 840 00:34:41,400 --> 00:34:43,650 that normally a performance engineer has 841 00:34:43,650 --> 00:34:44,965 to do to achieve this thing. 842 00:34:44,965 --> 00:34:47,340 And we want to also increase the portability, the ability 843 00:34:47,340 --> 00:34:48,840 to take that program from different machines 844 00:34:48,840 --> 00:34:49,500 to different. 845 00:34:49,500 --> 00:34:51,889 So let me give you an example. 846 00:34:51,889 --> 00:34:56,800 Here is a three by three blur example. 847 00:34:56,800 --> 00:34:59,760 So what this does is this [INAUDIBLE] two loops go 848 00:34:59,760 --> 00:35:05,190 in the x direction and do a blur in x direction, get the-- 849 00:35:05,190 --> 00:35:07,380 average the three values next to each other. 850 00:35:07,380 --> 00:35:08,760 And then it will go-- 851 00:35:08,760 --> 00:35:12,660 the result of that, do it in y direction, and average that. 852 00:35:12,660 --> 00:35:18,840 OK, very simple filter that you might want to do for image, 853 00:35:18,840 --> 00:35:19,860 you can run this. 854 00:35:19,860 --> 00:35:21,940 This is valid C code. 855 00:35:21,940 --> 00:35:23,820 But if you want to get performance, 856 00:35:23,820 --> 00:35:26,730 you want to generate this guy. 857 00:35:26,730 --> 00:35:29,970 This thing, on the other hand, ran about 11 times 858 00:35:29,970 --> 00:35:31,890 faster than this one. 859 00:35:31,890 --> 00:35:33,720 This has done tile. 860 00:35:33,720 --> 00:35:35,370 It has fused multiple loops. 861 00:35:35,370 --> 00:35:36,360 It has vectorized. 862 00:35:36,360 --> 00:35:37,650 It has multi-threaded. 863 00:35:37,650 --> 00:35:39,300 It had to do some redundant computation 864 00:35:39,300 --> 00:35:40,840 I'll get to a little bit later. 865 00:35:40,840 --> 00:35:44,020 And it basically gives a near roof-line optimum performance. 866 00:35:44,020 --> 00:35:47,760 That means it's using the machine resources to this max. 867 00:35:47,760 --> 00:35:50,410 Because this has a bunch of floating point operations. 868 00:35:50,410 --> 00:35:52,620 So it's basically floating point unit is 869 00:35:52,620 --> 00:35:53,970 running at the max performance. 870 00:35:53,970 --> 00:35:57,450 So there's nothing much else you could do to this one. 871 00:35:57,450 --> 00:35:59,270 But you write this thing. 872 00:35:59,270 --> 00:36:01,890 And this is not that easy. 873 00:36:01,890 --> 00:36:09,270 So this project started some time ago with one of my-- 874 00:36:09,270 --> 00:36:12,400 the person who did it-- going to Adobe. 875 00:36:12,400 --> 00:36:14,010 He went to Adobe. 876 00:36:14,010 --> 00:36:17,670 And they had this thing called a local laplacian 877 00:36:17,670 --> 00:36:20,190 filter in Camera Raw, and Lightroom, 878 00:36:20,190 --> 00:36:22,160 and Photoshop projects in here. 879 00:36:22,160 --> 00:36:27,180 The reference implementation was about 300 lines of code. 880 00:36:27,180 --> 00:36:29,460 But the implementation that they used 881 00:36:29,460 --> 00:36:31,440 was about 1,500 lines of code. 882 00:36:31,440 --> 00:36:34,020 It took three months of one of their best engineers 883 00:36:34,020 --> 00:36:35,190 to get to that performance. 884 00:36:35,190 --> 00:36:36,570 But it made sense. 885 00:36:36,570 --> 00:36:39,330 Because that engineer was able to get 886 00:36:39,330 --> 00:36:44,560 10x faster by trial and error for this piece of code. 887 00:36:44,560 --> 00:36:49,180 It's a non-trivial piece of coding here to go do that. 888 00:36:49,180 --> 00:36:53,610 So the student, Jonathan, who's now a professor at Berkeley, 889 00:36:53,610 --> 00:36:58,290 he basically, in one day, in 60 lines of Halide, 890 00:36:58,290 --> 00:37:04,640 he was able to beat 2x of Adobe code in some sense. 891 00:37:04,640 --> 00:37:07,440 And then Adobe, in those days, didn't 892 00:37:07,440 --> 00:37:11,220 generate any code for GPUs. 893 00:37:11,220 --> 00:37:14,370 Because they decided GPUs are changed too fast. 894 00:37:14,370 --> 00:37:17,570 And they can't keep up updating for GPUs in every generation. 895 00:37:17,570 --> 00:37:20,160 Then they-- because of that, they were not-- 896 00:37:20,160 --> 00:37:22,112 the Adobe applications were not using GPUs. 897 00:37:22,112 --> 00:37:24,320 So if you ran Photoshop, it's not going to use a GPU. 898 00:37:24,320 --> 00:37:26,000 Even if you mention it has a GPU. 899 00:37:26,000 --> 00:37:28,440 So Jonathan still had some time left in the day. 900 00:37:28,440 --> 00:37:30,750 So he said, OK, let me try to write on GPUs. 901 00:37:30,750 --> 00:37:32,690 So he just-- basically the same code, he 902 00:37:32,690 --> 00:37:38,940 changed GPUs and got a 9x faster than the fastest Adobe 903 00:37:38,940 --> 00:37:41,370 had ever had for this piece of code. 904 00:37:41,370 --> 00:37:43,500 So how did he do it? 905 00:37:43,500 --> 00:37:48,060 Again, the key principle here is decoupling algorithm 906 00:37:48,060 --> 00:37:49,390 from schedule. 907 00:37:49,390 --> 00:37:52,230 So algorithm, again, is what is computed. 908 00:37:52,230 --> 00:37:54,230 And the algorithm defined the pipeline 909 00:37:54,230 --> 00:37:57,380 of very simple pure functions operating in there. 910 00:37:57,380 --> 00:38:00,930 And execution order, parallelism, all those things 911 00:38:00,930 --> 00:38:02,670 is left for the schedule. 912 00:38:02,670 --> 00:38:04,590 The pipeline of Halide just looks 913 00:38:04,590 --> 00:38:06,680 like this for the blur filter. 914 00:38:06,680 --> 00:38:10,890 It says, OK, get the image in x dimension. 915 00:38:10,890 --> 00:38:12,680 And do it a blur in the y dimension. 916 00:38:12,680 --> 00:38:13,180 That's all. 917 00:38:13,180 --> 00:38:14,408 And the image size is-- 918 00:38:14,408 --> 00:38:16,200 because it's operating on the entire image, 919 00:38:16,200 --> 00:38:17,980 you don't have loops in here. 920 00:38:17,980 --> 00:38:20,460 That's all you have to say there. 921 00:38:20,460 --> 00:38:22,440 Then you have to come up with a schedule. 922 00:38:22,440 --> 00:38:27,720 Again, the same way when and where it's computed, 923 00:38:27,720 --> 00:38:30,420 to be simple, that you need to be able to tell that. 924 00:38:30,420 --> 00:38:31,380 And it has be powerful. 925 00:38:31,380 --> 00:38:33,755 You need to be able to get the hand-optimized performance 926 00:38:33,755 --> 00:38:34,930 or better by doing this. 927 00:38:38,507 --> 00:38:40,090 Something looks a little bit familiar. 928 00:38:40,090 --> 00:38:43,350 Because it's all these things, a lot of work 929 00:38:43,350 --> 00:38:46,410 you do performance kind of fit into this genre. 930 00:38:46,410 --> 00:38:50,010 You need to do a trade off between locality, parallelism, 931 00:38:50,010 --> 00:38:51,045 and redundant work. 932 00:38:51,045 --> 00:38:52,420 That's what you look for in here. 933 00:38:55,990 --> 00:39:02,300 So let's look at the three things you need to do. 934 00:39:02,300 --> 00:39:04,010 First, you need to get parallelism. 935 00:39:04,010 --> 00:39:06,610 Parallelism is you need to keep the multi-cores and vector 936 00:39:06,610 --> 00:39:09,768 units happy and probably the GPU busy. 937 00:39:09,768 --> 00:39:11,310 But if you have too much parallelism, 938 00:39:11,310 --> 00:39:12,532 it's not going to help you. 939 00:39:12,532 --> 00:39:14,740 I mean, nobody is going to take advantage [INAUDIBLE] 940 00:39:14,740 --> 00:39:15,500 parallelism. 941 00:39:15,500 --> 00:39:17,208 So let's look at a piece of code in here. 942 00:39:20,340 --> 00:39:22,120 So assume I am going to say, I'm going 943 00:39:22,120 --> 00:39:24,000 to run all these things parallel and all these things 944 00:39:24,000 --> 00:39:24,840 parallel afterwards. 945 00:39:24,840 --> 00:39:27,150 If you have three cores-- 946 00:39:27,150 --> 00:39:29,260 great, I got a lot more parallelism. 947 00:39:29,260 --> 00:39:31,100 I got six times parallelism. 948 00:39:31,100 --> 00:39:33,330 Hurrah, nobody's going to use that. 949 00:39:33,330 --> 00:39:36,900 It's not that useful to get six times parallelism in here. 950 00:39:36,900 --> 00:39:41,070 On the other hand, if you run like this, one at a time, 951 00:39:41,070 --> 00:39:42,505 you have parallelism of one. 952 00:39:42,505 --> 00:39:43,380 That's not that good. 953 00:39:43,380 --> 00:39:45,850 Because you're going to not use the machine. 954 00:39:45,850 --> 00:39:49,590 So what you really want is something basically-- 955 00:39:49,590 --> 00:39:51,360 OK, wait till it's done-- that actually 956 00:39:51,360 --> 00:39:53,430 do parallelisms of three might be 957 00:39:53,430 --> 00:39:55,110 the best way of running that machine 958 00:39:55,110 --> 00:39:56,730 to get best performance. 959 00:39:56,730 --> 00:39:57,730 You don't want too much. 960 00:39:57,730 --> 00:39:58,680 You don't want too little. 961 00:39:58,680 --> 00:40:00,263 You want to get the exact right thing. 962 00:40:03,520 --> 00:40:07,960 The next interesting thing you need to get is locality. 963 00:40:07,960 --> 00:40:10,630 Normally, when you do image processing, what you do 964 00:40:10,630 --> 00:40:13,978 is you change everything in the image in one filter. 965 00:40:13,978 --> 00:40:16,270 Then the next filter has to go in and change everything 966 00:40:16,270 --> 00:40:17,450 in the image. 967 00:40:17,450 --> 00:40:19,440 So what happens if one filter ran 968 00:40:19,440 --> 00:40:21,710 through the entire image and the next 969 00:40:21,710 --> 00:40:23,710 come and start running through the entire image? 970 00:40:23,710 --> 00:40:26,500 What happens, basically? 971 00:40:26,500 --> 00:40:27,970 Is that good? 972 00:40:27,970 --> 00:40:29,710 I give the entire image, say you, 973 00:40:29,710 --> 00:40:31,840 do my first color correction. 974 00:40:31,840 --> 00:40:35,150 And I will do some kind of aberration correction 975 00:40:35,150 --> 00:40:35,650 afterwards. 976 00:40:35,650 --> 00:40:39,370 So what happens if you do something like that? 977 00:40:39,370 --> 00:40:41,020 Entire image, process one filter, 978 00:40:41,020 --> 00:40:43,480 then the next filter takes the image and process the entire 979 00:40:43,480 --> 00:40:46,510 [INAUDIBLE] or whatever multi-megapixel image-- 980 00:40:48,642 --> 00:40:50,100 oh, you-- you're on a [INAUDIBLE].. 981 00:40:50,100 --> 00:40:51,262 OK, back there. 982 00:40:51,262 --> 00:40:53,318 AUDIENCE: You end up kicking [INAUDIBLE].. 983 00:40:53,318 --> 00:40:54,860 SAMAN AMARASIGNHE: [INAUDIBLE] cache. 984 00:40:54,860 --> 00:40:57,350 Because if the image is large, it doesn't fit in the cache. 985 00:40:57,350 --> 00:40:59,000 It's not that great to do this. 986 00:40:59,000 --> 00:41:01,610 You won't get things in the cache in here. 987 00:41:01,610 --> 00:41:07,370 So assume I go like this, processing the entire first row 988 00:41:07,370 --> 00:41:09,240 before you go to the second row. 989 00:41:09,240 --> 00:41:13,880 So what happens now here is we need to start touch this one-- 990 00:41:13,880 --> 00:41:16,100 I need to read these two values. 991 00:41:16,100 --> 00:41:17,570 And those two are-- the last time 992 00:41:17,570 --> 00:41:21,200 I read them was way before I started. 993 00:41:21,200 --> 00:41:22,220 So I [INAUDIBLE] them. 994 00:41:22,220 --> 00:41:23,620 I went through all the image. 995 00:41:23,620 --> 00:41:24,620 And I come back to that. 996 00:41:24,620 --> 00:41:28,613 And this distance-- by the time I reach here, 997 00:41:28,613 --> 00:41:30,530 I might-- these two might be out of the cache. 998 00:41:30,530 --> 00:41:33,650 And when I go back there, oops, it's not in the cache. 999 00:41:33,650 --> 00:41:36,183 I have a problem in that. 1000 00:41:36,183 --> 00:41:37,850 So the other way, a right way to do that 1001 00:41:37,850 --> 00:41:40,540 might be trying it this way. 1002 00:41:40,540 --> 00:41:42,800 If I run it like this, basically, what 1003 00:41:42,800 --> 00:41:46,180 happens is as you run-- 1004 00:41:46,180 --> 00:41:47,990 when I touch this, I won't have-- 1005 00:41:47,990 --> 00:41:50,060 I want to get these three to run this thing. 1006 00:41:50,060 --> 00:41:53,210 Last time I read this one was just before, 1007 00:41:53,210 --> 00:41:55,880 in the previous iteration. 1008 00:41:55,880 --> 00:41:57,020 So to get to that-- 1009 00:41:57,020 --> 00:41:58,087 I just touched it. 1010 00:41:58,087 --> 00:42:00,420 So the next guy uses it, the next guy, and after, after. 1011 00:42:00,420 --> 00:42:01,500 I go to my window. 1012 00:42:01,500 --> 00:42:02,750 I've never touched that again. 1013 00:42:02,750 --> 00:42:04,790 I have a really good locality in here. 1014 00:42:04,790 --> 00:42:06,610 So I want to operate it that way. 1015 00:42:06,610 --> 00:42:09,000 I won't get good locality in here. 1016 00:42:09,000 --> 00:42:11,090 So redundant work is a very interesting thing. 1017 00:42:11,090 --> 00:42:15,050 Sometimes, if you want to get both locality and parallelism, 1018 00:42:15,050 --> 00:42:17,110 you might have to do some extra work, 1019 00:42:17,110 --> 00:42:19,740 a little bit of extra work. 1020 00:42:19,740 --> 00:42:25,520 So assume in this one I had to process these elements parallel 1021 00:42:25,520 --> 00:42:27,710 if I want to run these three. 1022 00:42:27,710 --> 00:42:32,250 Because these three needs all these four elements in there. 1023 00:42:32,250 --> 00:42:34,040 These three need these four. 1024 00:42:34,040 --> 00:42:35,840 If I want to run these two parallel in two 1025 00:42:35,840 --> 00:42:39,410 different cores, it might be better if both 1026 00:42:39,410 --> 00:42:41,063 calculates these two values. 1027 00:42:41,063 --> 00:42:42,980 Because I don't have to synchronize and stuff. 1028 00:42:42,980 --> 00:42:45,410 I can say, the left guy, calculate four values. 1029 00:42:45,410 --> 00:42:46,620 And then I can do the three. 1030 00:42:46,620 --> 00:42:48,328 The right guy, calculate the four values. 1031 00:42:48,328 --> 00:42:49,495 And then I can do the three. 1032 00:42:49,495 --> 00:42:50,570 I can do that parallelly. 1033 00:42:50,570 --> 00:42:54,470 But now the middle two guys, these two get calculated twice. 1034 00:42:54,470 --> 00:42:56,930 Because both needs it. 1035 00:42:56,930 --> 00:43:00,590 And so what that means is-- oops, you can keep that. 1036 00:43:00,590 --> 00:43:03,348 So sometimes, to do everything, I 1037 00:43:03,348 --> 00:43:04,890 might have to do some redundant work. 1038 00:43:04,890 --> 00:43:08,120 So the way to look at that is I can put this 1039 00:43:08,120 --> 00:43:11,870 into this scheduling framework. 1040 00:43:11,870 --> 00:43:14,150 I can map my computation bandwidth. 1041 00:43:14,150 --> 00:43:16,700 That means coarse interleaving with low locality. 1042 00:43:16,700 --> 00:43:18,870 That means I finish everything before I go back 1043 00:43:18,870 --> 00:43:21,020 in here between two things. 1044 00:43:21,020 --> 00:43:22,700 If I run two things, I finish this one 1045 00:43:22,700 --> 00:43:24,410 before I go to the next one. 1046 00:43:24,410 --> 00:43:27,120 Fine interleaving means I process one element one, 1047 00:43:27,120 --> 00:43:29,900 duh duh duh duh, go back and back and forth in here. 1048 00:43:29,900 --> 00:43:32,340 That's my two options here. 1049 00:43:32,340 --> 00:43:34,700 Other side is storage granularity. 1050 00:43:34,700 --> 00:43:37,760 What that means is-- 1051 00:43:37,760 --> 00:43:41,360 storage granularity very low means I calculate something, 1052 00:43:41,360 --> 00:43:42,320 I don't remember. 1053 00:43:42,320 --> 00:43:46,760 Next time I want it, I recalculate it again. 1054 00:43:46,760 --> 00:43:49,910 Very high storage granularity means once I calculate it, 1055 00:43:49,910 --> 00:43:51,200 I will remember it forever. 1056 00:43:51,200 --> 00:43:53,630 Anytime you need that value, I have it back for you. 1057 00:43:53,630 --> 00:43:56,210 So that means I have to get it to you from anywhere 1058 00:43:56,210 --> 00:43:57,230 I calculated. 1059 00:43:57,230 --> 00:43:59,960 Storage granularity low means my process, I calculate, I use, 1060 00:43:59,960 --> 00:44:00,710 I throw it out. 1061 00:44:00,710 --> 00:44:04,860 If anybody else want it, they'll recalculate again. 1062 00:44:04,860 --> 00:44:07,310 So now you can have many different computations 1063 00:44:07,310 --> 00:44:09,710 in different places of this space in here. 1064 00:44:09,710 --> 00:44:11,630 So if you want to compute something here, 1065 00:44:11,630 --> 00:44:13,850 this is the scheduling language. 1066 00:44:13,850 --> 00:44:16,370 That means I run this one, and I run this one. 1067 00:44:16,370 --> 00:44:18,680 I have no redundant computation, very 1068 00:44:18,680 --> 00:44:20,990 coarse grained interleaving. 1069 00:44:20,990 --> 00:44:22,970 That means I run the entire thing, and then 1070 00:44:22,970 --> 00:44:25,310 the next entire thing. 1071 00:44:25,310 --> 00:44:27,440 You can go very fine [INAUDIBLE] in here. 1072 00:44:27,440 --> 00:44:28,910 I'll calculate this one. 1073 00:44:28,910 --> 00:44:31,400 And I'll calculate these three again, these three again. 1074 00:44:31,400 --> 00:44:34,430 So everything is calculated multiple times. 1075 00:44:34,430 --> 00:44:37,370 When you need it, I recalculate every time I need something. 1076 00:44:37,370 --> 00:44:39,140 I don't store anything in here. 1077 00:44:39,140 --> 00:44:39,770 So it's good. 1078 00:44:39,770 --> 00:44:41,420 I have a lot of locality. 1079 00:44:41,420 --> 00:44:44,695 But I'm doing a lot of recomputation. 1080 00:44:44,695 --> 00:44:46,070 And then here, you have something 1081 00:44:46,070 --> 00:44:47,470 like a sliding window. 1082 00:44:47,470 --> 00:44:50,832 Basically, you are not recalculating anything. 1083 00:44:50,832 --> 00:44:52,040 But you are sliding in there. 1084 00:44:52,040 --> 00:44:53,990 You have a little bit less parallelism. 1085 00:44:53,990 --> 00:44:56,810 And then you could capture this entire spectrum 1086 00:44:56,810 --> 00:44:59,780 in between in here. 1087 00:44:59,780 --> 00:45:03,800 And you can get different levels of fusion of these tiles. 1088 00:45:03,800 --> 00:45:06,230 And you can calculate-- 1089 00:45:06,230 --> 00:45:08,060 so I don't recalculate everything. 1090 00:45:08,060 --> 00:45:09,800 I recalculate a few things in here. 1091 00:45:09,800 --> 00:45:11,480 These two get recalculated. 1092 00:45:14,120 --> 00:45:14,870 And then you can-- 1093 00:45:14,870 --> 00:45:16,675 I'll go through this fast [INAUDIBLE].. 1094 00:45:16,675 --> 00:45:18,050 You can use all these operations. 1095 00:45:18,050 --> 00:45:19,700 So here is the interesting thing. 1096 00:45:19,700 --> 00:45:23,360 So here is I am showing you different schedules 1097 00:45:23,360 --> 00:45:25,470 at different points in here. 1098 00:45:25,470 --> 00:45:26,720 So I'm going to run this game. 1099 00:45:26,720 --> 00:45:28,400 This is on time. 1100 00:45:28,400 --> 00:45:30,268 So what it says is this is doing-- 1101 00:45:30,268 --> 00:45:31,810 you're going through the first input, 1102 00:45:31,810 --> 00:45:34,640 [INAUDIBLE] the middle one, [INAUDIBLE] the output in here. 1103 00:45:34,640 --> 00:45:37,570 So this has all locality, lot of redundant work, 1104 00:45:37,570 --> 00:45:39,680 good patterns of locality. 1105 00:45:39,680 --> 00:45:41,180 All patterns have not good locality. 1106 00:45:41,180 --> 00:45:43,190 In here is some kind of intermediate thing. 1107 00:45:43,190 --> 00:45:45,410 So what it shows is these are no good. 1108 00:45:45,410 --> 00:45:47,710 A good balance between locality, parallelism, 1109 00:45:47,710 --> 00:45:49,920 and some redundant work seem to do really well. 1110 00:45:49,920 --> 00:45:52,052 This guy finished the fastest. 1111 00:45:52,052 --> 00:45:54,010 So what you do is you write different schedules 1112 00:45:54,010 --> 00:45:54,718 for these things. 1113 00:45:54,718 --> 00:45:56,040 And you keep running. 1114 00:45:56,040 --> 00:45:58,490 And we figured out what schedule works. 1115 00:45:58,490 --> 00:46:01,250 So this is kind of trial and error part 1116 00:46:01,250 --> 00:46:02,329 you have to do in here. 1117 00:46:09,670 --> 00:46:14,400 So if you look at what's going on 1118 00:46:14,400 --> 00:46:18,220 in here, what you see here is-- 1119 00:46:18,220 --> 00:46:21,300 there's some example-- is bilateral filter computation 1120 00:46:21,300 --> 00:46:21,960 here. 1121 00:46:21,960 --> 00:46:26,370 What it says is the original is about 122 lines of C++ code. 1122 00:46:26,370 --> 00:46:31,320 And you found something with a good parallelism in here. 1123 00:46:31,320 --> 00:46:35,580 But we could write it in 32 lines of Halide in here. 1124 00:46:35,580 --> 00:46:40,530 And we were able to get about 6x faster than CPU. 1125 00:46:40,530 --> 00:46:43,860 But the best algorithm was somebody hand 1126 00:46:43,860 --> 00:46:46,320 wrote for the paper on GPUs. 1127 00:46:46,320 --> 00:46:48,810 And what it did it was it gave up some parallelism 1128 00:46:48,810 --> 00:46:50,470 for much better locality. 1129 00:46:50,470 --> 00:46:52,470 And if you give up some parallelism, much better 1130 00:46:52,470 --> 00:46:54,510 locality, because we can optimize in that, 1131 00:46:54,510 --> 00:46:57,060 we got faster than their handwritten algorithm. 1132 00:46:57,060 --> 00:47:00,420 So we can change something. 1133 00:47:00,420 --> 00:47:03,030 Here's, again, another algorithm that 1134 00:47:03,030 --> 00:47:06,260 is doing segmenting in here. 1135 00:47:06,260 --> 00:47:08,070 And it was written in MATLAB. 1136 00:47:08,070 --> 00:47:11,220 And MATLAB is a lot less lines of code, of course. 1137 00:47:11,220 --> 00:47:13,080 But in Halide, it's a little bit more line, 1138 00:47:13,080 --> 00:47:16,080 because you're not just calling library functions. 1139 00:47:16,080 --> 00:47:19,330 And Halide was 70 times faster. 1140 00:47:19,330 --> 00:47:23,220 And if you run into GPU verses MATLAB, it's about 100-- 1141 00:47:23,220 --> 00:47:26,250 1,000 times faster. 1142 00:47:26,250 --> 00:47:30,540 It's not because you're running bad MATLAB loops. 1143 00:47:30,540 --> 00:47:34,260 In fact, what MATLAB did was it called very well hand-optimized 1144 00:47:34,260 --> 00:47:35,500 libraries. 1145 00:47:35,500 --> 00:47:38,000 But the problem with calling libraries, there's no locality. 1146 00:47:38,000 --> 00:47:40,780 I called a really fast library for first routine. 1147 00:47:40,780 --> 00:47:42,533 It runs really fast. 1148 00:47:42,533 --> 00:47:43,950 And then you call the next routine 1149 00:47:43,950 --> 00:47:45,908 that has to [INAUDIBLE] the entire image again. 1150 00:47:45,908 --> 00:47:47,940 And now my image is completely off the cache. 1151 00:47:47,940 --> 00:47:50,370 So what happens is between these very fast libraries, 1152 00:47:50,370 --> 00:47:52,210 you're bringing the image from cache. 1153 00:47:52,210 --> 00:47:54,330 And when you have something like a library, 1154 00:47:54,330 --> 00:47:57,300 you can't fuse library functions together. 1155 00:47:57,300 --> 00:47:59,010 In Halide, we can confuse them together 1156 00:47:59,010 --> 00:48:01,322 and say, oh, I take this line of the image, 1157 00:48:01,322 --> 00:48:03,030 and I will do everything on that before I 1158 00:48:03,030 --> 00:48:04,150 move to the next thing. 1159 00:48:04,150 --> 00:48:06,030 So I can do much faster. 1160 00:48:06,030 --> 00:48:08,010 My feeling is each function probably, 1161 00:48:08,010 --> 00:48:11,490 in MATLAB was faster, because they have a handwritten really 1162 00:48:11,490 --> 00:48:12,330 fast thing. 1163 00:48:12,330 --> 00:48:14,200 But the copying of data from-- 1164 00:48:14,200 --> 00:48:17,040 the moving from cache, the cache effects, 1165 00:48:17,040 --> 00:48:20,580 was really slowing it down. 1166 00:48:20,580 --> 00:48:24,150 So here's the thing that we showed before. 1167 00:48:24,150 --> 00:48:26,280 This is a very complicated algorithm. 1168 00:48:26,280 --> 00:48:28,530 It's what we call a pyramidal algorithm. 1169 00:48:28,530 --> 00:48:30,810 So what it does is you take a [INAUDIBLE] in here. 1170 00:48:30,810 --> 00:48:33,300 And you divide it into a bunch of blocks in here 1171 00:48:33,300 --> 00:48:35,400 in each level of pyramid. 1172 00:48:35,400 --> 00:48:40,110 And you do some computation, do some look up, and do some 1173 00:48:40,110 --> 00:48:41,370 up sampling in here. 1174 00:48:41,370 --> 00:48:44,400 You do some addition computation and compute that. 1175 00:48:44,400 --> 00:48:47,680 And then you create more and more smaller and smaller images 1176 00:48:47,680 --> 00:48:48,180 in here. 1177 00:48:48,180 --> 00:48:50,180 You do-- you basically [INAUDIBLE] image pyramid 1178 00:48:50,180 --> 00:48:50,860 in here. 1179 00:48:50,860 --> 00:48:55,380 And so to do this right, it's not that simple. 1180 00:48:55,380 --> 00:48:57,290 What that means, in each of these level, 1181 00:48:57,290 --> 00:48:59,850 there are different balances you want to be. 1182 00:48:59,850 --> 00:49:02,722 If you have a lot of data, parallelism is not 1183 00:49:02,722 --> 00:49:03,930 that important at that point. 1184 00:49:03,930 --> 00:49:05,160 Because you have parallelism anyways. 1185 00:49:05,160 --> 00:49:07,410 You probably had to focus a lot more on locality. 1186 00:49:07,410 --> 00:49:09,150 But when you get to the smaller amount, 1187 00:49:09,150 --> 00:49:10,380 I think parallelism matters. 1188 00:49:10,380 --> 00:49:12,870 So you have to come up with very interesting balances 1189 00:49:12,870 --> 00:49:13,680 between those. 1190 00:49:13,680 --> 00:49:16,390 So many, many things to tune at every level. 1191 00:49:16,390 --> 00:49:17,550 There's not three things. 1192 00:49:17,550 --> 00:49:19,140 There's hundreds of different levels. 1193 00:49:19,140 --> 00:49:21,100 So the nice thing about Halide is 1194 00:49:21,100 --> 00:49:23,370 you can play with all these things. 1195 00:49:23,370 --> 00:49:25,462 You can play with all these different concepts 1196 00:49:25,462 --> 00:49:27,420 and figure out which actually gives the fastest 1197 00:49:27,420 --> 00:49:28,253 performance in that. 1198 00:49:31,910 --> 00:49:36,790 So a little bit of, I would say, bragging rights 1199 00:49:36,790 --> 00:49:41,800 in here for Halide, Halide left MIT about, I think, 1200 00:49:41,800 --> 00:49:43,510 six years ago. 1201 00:49:43,510 --> 00:49:46,660 And right now, it's everywhere in Google. 1202 00:49:46,660 --> 00:49:49,056 So it's on Android phones. 1203 00:49:49,056 --> 00:49:50,845 It started to Google Glass. 1204 00:49:50,845 --> 00:49:54,850 It doesn't exist anymore, but in that-- 1205 00:49:54,850 --> 00:49:57,550 and in fact, any-- 1206 00:49:57,550 --> 00:50:01,030 all the images, all the videos uploaded to YouTube right now, 1207 00:50:01,030 --> 00:50:02,620 they do front end processing. 1208 00:50:02,620 --> 00:50:05,600 And that processing pipeline is written in Halide. 1209 00:50:05,600 --> 00:50:11,110 And they switched to Halide because Halide code was about, 1210 00:50:11,110 --> 00:50:14,650 I think 4-5% faster than the previous version. 1211 00:50:14,650 --> 00:50:19,480 And 4-5% faster for Google was multi-million dollars 1212 00:50:19,480 --> 00:50:22,420 saved for them, because there's so many videos 1213 00:50:22,420 --> 00:50:23,680 getting downloaded from that. 1214 00:50:26,290 --> 00:50:28,840 So recently, there's a Photoshop announcement 1215 00:50:28,840 --> 00:50:31,750 that's saying they have an IOS version of Photoshop 1216 00:50:31,750 --> 00:50:32,352 from Adobe. 1217 00:50:32,352 --> 00:50:33,310 They just announced it. 1218 00:50:33,310 --> 00:50:34,690 I don't think it's even out yet. 1219 00:50:34,690 --> 00:50:37,960 And the entire Photoshop filters are 1220 00:50:37,960 --> 00:50:39,810 written in this new version using Halide. 1221 00:50:46,280 --> 00:50:49,640 Qualcomm released this processor called Snapdragon image 1222 00:50:49,640 --> 00:50:50,390 processor. 1223 00:50:50,390 --> 00:50:53,950 So they built that processor to do image processing in there. 1224 00:50:53,950 --> 00:50:57,560 And the programming language to program that processor 1225 00:50:57,560 --> 00:50:58,940 is basically Halide. 1226 00:50:58,940 --> 00:51:00,590 So you write the code in Halide. 1227 00:51:00,590 --> 00:51:02,450 So that is the kind of-- the assembly level 1228 00:51:02,450 --> 00:51:05,600 that makes it available for this in here. 1229 00:51:05,600 --> 00:51:07,410 And also, Intel is using that. 1230 00:51:07,410 --> 00:51:11,090 So there's lot of use of this system at this point, which 1231 00:51:11,090 --> 00:51:15,560 is really fun to see academic project getting to a point it's 1232 00:51:15,560 --> 00:51:16,910 very heavily used. 1233 00:51:16,910 --> 00:51:19,955 And part of that is because it's very useful. 1234 00:51:19,955 --> 00:51:22,370 Because people realize they need to optimize 1235 00:51:22,370 --> 00:51:26,270 these code, because cameras and stuff, performance matter. 1236 00:51:26,270 --> 00:51:29,900 And it needs to look-- having some poor engineer spending 1237 00:51:29,900 --> 00:51:32,990 months in the corner, just trying out those things, 1238 00:51:32,990 --> 00:51:37,410 you can try the same things, and lot more by do it faster. 1239 00:51:37,410 --> 00:51:39,360 OK so let me ask you a question. 1240 00:51:39,360 --> 00:51:44,750 So now between Halide and GraphIt, what did you find? 1241 00:51:47,918 --> 00:51:48,960 A bunch of similarities-- 1242 00:51:48,960 --> 00:51:52,590 I want to figure out are there any interesting similarities 1243 00:51:52,590 --> 00:51:54,630 you guys found between these two projects? 1244 00:52:04,490 --> 00:52:08,313 AUDIENCE: They both allow you to try optimizations really fast. 1245 00:52:08,313 --> 00:52:09,730 SAMAN AMARASIGNHE: So part of that 1246 00:52:09,730 --> 00:52:14,380 is also a lot of times compilers are kind of black box. 1247 00:52:14,380 --> 00:52:16,150 We know everything, just feed us, 1248 00:52:16,150 --> 00:52:18,110 we'll give you the really fast code. 1249 00:52:18,110 --> 00:52:21,100 And the problem is, they're never the fastest. 1250 00:52:21,100 --> 00:52:23,670 So if you really care about performance, you get 90%. 1251 00:52:23,670 --> 00:52:25,780 Then you get really frustrated-- now what do I do? 1252 00:52:25,780 --> 00:52:28,240 But this was, OK, I'm not going to-- 1253 00:52:28,240 --> 00:52:29,800 you are better at what to do. 1254 00:52:29,800 --> 00:52:31,510 But I'll make you life simpler. 1255 00:52:31,510 --> 00:52:33,760 So we still want the performance engineer. 1256 00:52:33,760 --> 00:52:37,480 It's not the person who just don't understand performance 1257 00:52:37,480 --> 00:52:38,572 feed, you get fast code. 1258 00:52:38,572 --> 00:52:40,030 We need a performance-- but we want 1259 00:52:40,030 --> 00:52:42,270 to make performance engineers life easier. 1260 00:52:42,270 --> 00:52:44,675 So both of them, said, OK, we need performance engineer. 1261 00:52:44,675 --> 00:52:45,550 We can't automate it. 1262 00:52:45,550 --> 00:52:47,508 We don't know how to automate all these things. 1263 00:52:47,508 --> 00:52:49,000 There's too much complexity. 1264 00:52:49,000 --> 00:52:53,050 But we will let you, performance engineer, explain what to do. 1265 00:52:53,050 --> 00:52:55,330 But we'll make your life very simple. 1266 00:52:55,330 --> 00:52:57,112 What else? 1267 00:52:57,112 --> 00:52:58,534 AUDIENCE: Something that was cool 1268 00:52:58,534 --> 00:53:00,930 was both of these languages can do 1269 00:53:00,930 --> 00:53:03,420 algorithmic level optimizations [INAUDIBLE],, 1270 00:53:03,420 --> 00:53:07,757 which is pretty different from what compilers like GCC 1271 00:53:07,757 --> 00:53:08,840 are explained [INAUDIBLE]. 1272 00:53:08,840 --> 00:53:10,257 SAMAN AMARASIGNHE: Yeah, because-- 1273 00:53:10,257 --> 00:53:11,810 I wouldn't say alg-- 1274 00:53:11,810 --> 00:53:14,910 you can do a lot of domain specific optimization. 1275 00:53:14,910 --> 00:53:17,320 So algorithmic optimization is one level higher. 1276 00:53:17,320 --> 00:53:19,562 You can say, ah ha, I have a better algorithm. 1277 00:53:19,562 --> 00:53:21,520 So, OK, I don't want to do a quick search here. 1278 00:53:21,520 --> 00:53:22,540 I can do insertion sort. 1279 00:53:22,540 --> 00:53:24,510 Because quick sorting and insertion 1280 00:53:24,510 --> 00:53:27,140 sort might be faster for a certain class in there. 1281 00:53:27,140 --> 00:53:29,620 So that is level change we don't do. 1282 00:53:29,620 --> 00:53:31,662 Or, worse yet, I can say-- 1283 00:53:31,662 --> 00:53:34,120 this happens in a lot of things in machine learning-- yeah, 1284 00:53:34,120 --> 00:53:36,153 if I just drop a number here, I'm OK. 1285 00:53:36,153 --> 00:53:38,070 I don't have to get the compute exactly right. 1286 00:53:38,070 --> 00:53:38,728 Oh yeah, if I-- 1287 00:53:38,728 --> 00:53:40,270 I don't have to calculate everything. 1288 00:53:40,270 --> 00:53:42,690 If I calculate for 10 people, it's good enough. 1289 00:53:42,690 --> 00:53:44,910 So that kind of changes, you can't do. 1290 00:53:44,910 --> 00:53:46,330 Because that's very contextual. 1291 00:53:46,330 --> 00:53:48,110 Like, for example, a lot of time, 1292 00:53:48,110 --> 00:53:50,380 if you are doing things like machine learning, 1293 00:53:50,380 --> 00:53:52,270 there's no right answer. 1294 00:53:52,270 --> 00:53:53,710 You need to have a good answer. 1295 00:53:53,710 --> 00:53:56,500 So sometimes good means you can not do certain things. 1296 00:53:56,500 --> 00:54:00,310 And you need to find what things you shouldn't-- you cannot do, 1297 00:54:00,310 --> 00:54:03,160 that you get a huge benefit, but you don't lose that much. 1298 00:54:03,160 --> 00:54:04,540 That level you can't do that. 1299 00:54:04,540 --> 00:54:06,950 That's the next level of [INAUDIBLE] is saying, OK, 1300 00:54:06,950 --> 00:54:07,780 how do you do that? 1301 00:54:07,780 --> 00:54:09,940 How do you-- when somebody say, OK, look, 1302 00:54:09,940 --> 00:54:13,000 I can train it for 10 iterations versus 100-- 1303 00:54:13,000 --> 00:54:14,750 ah, 10 is good enough. 1304 00:54:14,750 --> 00:54:17,140 I can't-- if your code is written to train for 100 1305 00:54:17,140 --> 00:54:19,660 iterations, I can't tell you, oh yeah, 10 is good enough. 1306 00:54:19,660 --> 00:54:23,220 That is a decision that has to be a lot higher level than what 1307 00:54:23,220 --> 00:54:23,720 I can make. 1308 00:54:23,720 --> 00:54:26,500 So that's a-- that-- there's an interesting level that 1309 00:54:26,500 --> 00:54:30,200 can still exist on top of that, which we can't automate that 1310 00:54:30,200 --> 00:54:30,700 easily. 1311 00:54:30,700 --> 00:54:33,222 But we might be able to make, still, 1312 00:54:33,222 --> 00:54:35,680 a language, like a schedule language, give you that option. 1313 00:54:35,680 --> 00:54:37,400 That's a cool option to give, say 1314 00:54:37,400 --> 00:54:40,090 try some of these things that actually change the algorithm. 1315 00:54:40,090 --> 00:54:42,970 But within the algorithm, that means I'll still 1316 00:54:42,970 --> 00:54:44,400 give you the same answer. 1317 00:54:44,400 --> 00:54:46,128 I will try different things. 1318 00:54:51,492 --> 00:54:52,200 Any other things? 1319 00:54:52,200 --> 00:54:56,816 Any other things you guys thought that was interesting? 1320 00:55:00,600 --> 00:55:03,540 How about from somewhere here? 1321 00:55:03,540 --> 00:55:05,470 What are the interesting things you found? 1322 00:55:08,200 --> 00:55:08,950 Back there. 1323 00:55:08,950 --> 00:55:11,280 AUDIENCE: They both involve a lot of trial and error. 1324 00:55:11,280 --> 00:55:12,780 SAMAN AMARASIGNHE: Yes, both involve 1325 00:55:12,780 --> 00:55:13,822 a lot of trial and error. 1326 00:55:13,822 --> 00:55:17,230 I mean, this is the modern computer systems. 1327 00:55:17,230 --> 00:55:19,160 Everything is extremely complicated. 1328 00:55:19,160 --> 00:55:20,870 There's no right way of doing things 1329 00:55:20,870 --> 00:55:23,272 when you look at this pretty large piece of code. 1330 00:55:23,272 --> 00:55:24,730 And there might be a lot of-- there 1331 00:55:24,730 --> 00:55:29,530 are caches, parallelism, locality, a lot of things 1332 00:55:29,530 --> 00:55:30,895 that can go right. 1333 00:55:30,895 --> 00:55:32,770 And so you might have to try out many things. 1334 00:55:32,770 --> 00:55:34,780 So if you know the answer, if you come up with the, 1335 00:55:34,780 --> 00:55:36,580 I know exactly, every time, I have the right answer, 1336 00:55:36,580 --> 00:55:37,570 that's amazing. 1337 00:55:37,570 --> 00:55:40,330 But even the best performance person 1338 00:55:40,330 --> 00:55:42,850 might not be able to look at a piece of code and say, ah ha, 1339 00:55:42,850 --> 00:55:43,725 I know your solution. 1340 00:55:43,725 --> 00:55:45,490 You do a lot of trial and error. 1341 00:55:45,490 --> 00:55:46,640 This kind of supports that. 1342 00:55:46,640 --> 00:55:48,430 And you probably have figured that one out 1343 00:55:48,430 --> 00:55:49,513 for most of your projects. 1344 00:55:49,513 --> 00:55:52,120 It's not like you went and say, ah, I know what to do. 1345 00:55:52,120 --> 00:55:54,040 You probably did many different trials. 1346 00:55:54,040 --> 00:55:55,900 And I know from this, lot of things, 1347 00:55:55,900 --> 00:55:58,510 you actually either have no impact or slow down the code. 1348 00:55:58,510 --> 00:56:00,010 And you say, oh, that didn't work. 1349 00:56:00,010 --> 00:56:02,440 And all if-- [INAUDIBLE] you leave in your codes 1350 00:56:02,440 --> 00:56:05,120 shows all the crazy things you've tried, 1351 00:56:05,120 --> 00:56:06,878 and nothing happened. 1352 00:56:06,878 --> 00:56:08,170 So that's an interesting thing. 1353 00:56:08,170 --> 00:56:08,670 What else? 1354 00:56:15,160 --> 00:56:17,290 Anything else, a little bit differently 1355 00:56:17,290 --> 00:56:18,756 that you see on this one? 1356 00:56:25,100 --> 00:56:29,004 AUDIENCE: I was just wondering, are there other similar domains 1357 00:56:29,004 --> 00:56:35,267 that don't have something like this [INAUDIBLE]?? 1358 00:56:35,267 --> 00:56:37,100 SAMAN AMARASIGNHE: So interesting question-- 1359 00:56:37,100 --> 00:56:41,390 are there any other domains that don't have something like this? 1360 00:56:41,390 --> 00:56:43,580 People are working on similar things 1361 00:56:43,580 --> 00:56:44,950 to machine learning these days. 1362 00:56:44,950 --> 00:56:48,380 That seems to be their domain, and tensor flow, 1363 00:56:48,380 --> 00:56:51,350 and all those people are trying to do-- 1364 00:56:51,350 --> 00:56:54,533 to build systems like similar-- like frameworks 1365 00:56:54,533 --> 00:56:55,450 that you can get that. 1366 00:56:57,892 --> 00:56:58,850 I mean, that's a very-- 1367 00:56:58,850 --> 00:57:02,180 I think-- the way I have operated 1368 00:57:02,180 --> 00:57:04,210 is I go talk to people. 1369 00:57:04,210 --> 00:57:07,760 And sometimes you find this poor graduate student, 1370 00:57:07,760 --> 00:57:11,220 or postdoc who want to do some research but spending all 1371 00:57:11,220 --> 00:57:13,882 of their time basically optimizing their piece of code, 1372 00:57:13,882 --> 00:57:15,590 because they can't get their performance. 1373 00:57:15,590 --> 00:57:17,150 And then that might be a good domain. 1374 00:57:17,150 --> 00:57:19,010 You find these people in physics. 1375 00:57:19,010 --> 00:57:21,040 You find these people in biology. 1376 00:57:21,040 --> 00:57:22,370 And I am actually talking to-- 1377 00:57:22,370 --> 00:57:24,500 because, for example, in biology, 1378 00:57:24,500 --> 00:57:26,930 a lot of this gene sequencing stuff is-- 1379 00:57:26,930 --> 00:57:29,300 there are very similar things you have to do. 1380 00:57:29,300 --> 00:57:31,250 But they seem to be spending all this time 1381 00:57:31,250 --> 00:57:33,180 writing the code, and then-- 1382 00:57:33,180 --> 00:57:35,390 mired in code complexity. 1383 00:57:35,390 --> 00:57:37,160 OK, can you do something in that? 1384 00:57:37,160 --> 00:57:39,950 I mean, the key thing is this is a good way to-- a nice thing 1385 00:57:39,950 --> 00:57:43,040 about MIT is there are good-- a lot of very smart people 1386 00:57:43,040 --> 00:57:46,115 in many different domains trying to push the state of the art. 1387 00:57:46,115 --> 00:57:48,740 And who's spending all this time cursing in front of a computer 1388 00:57:48,740 --> 00:57:51,590 program to get to a point they want to do, 1389 00:57:51,590 --> 00:57:54,470 because-- not because they don't know the algorithm, 1390 00:57:54,470 --> 00:57:57,350 because the amount of data they have to deal with-- 1391 00:57:57,350 --> 00:58:01,310 astronomy, I mean they get these multiple telescopes, 1392 00:58:01,310 --> 00:58:02,990 that deluge of data. 1393 00:58:02,990 --> 00:58:05,360 And most of the time, they know what they have to do. 1394 00:58:05,360 --> 00:58:07,152 They just have to-- can't write the program 1395 00:58:07,152 --> 00:58:08,070 to do it fast enough. 1396 00:58:08,070 --> 00:58:10,470 So there might be domains like that, if you look at that. 1397 00:58:10,470 --> 00:58:12,877 And there might be domains from application and domains 1398 00:58:12,877 --> 00:58:13,460 from patterns. 1399 00:58:13,460 --> 00:58:15,350 Like sparse matrices or graphs are 1400 00:58:15,350 --> 00:58:18,410 patterns, which-- or not only on a single application. 1401 00:58:18,410 --> 00:58:20,340 I mean, it works in multiple places. 1402 00:58:20,340 --> 00:58:21,590 There might be other patterns. 1403 00:58:21,590 --> 00:58:24,167 Say, this is-- if you want to do research, 1404 00:58:24,167 --> 00:58:26,250 this might be interesting piece of doing research. 1405 00:58:26,250 --> 00:58:31,070 And I have spent my life finding different domains and a bunch 1406 00:58:31,070 --> 00:58:34,160 of people that spend their lifetime just hand hacking 1407 00:58:34,160 --> 00:58:37,400 things and telling them, OK, let me see if we 1408 00:58:37,400 --> 00:58:41,220 can do some nice abstraction. 1409 00:58:41,220 --> 00:58:44,870 Anything that you guys found that's interesting? 1410 00:58:44,870 --> 00:58:48,410 So to both of them, what are-- what's 1411 00:58:48,410 --> 00:58:52,265 the space that they operated on to optimize programs? 1412 00:58:59,195 --> 00:59:00,660 AUDIENCE: [INAUDIBLE]. 1413 00:59:00,660 --> 00:59:01,350 SAMAN AMARASIGNHE: Done for me, no. 1414 00:59:01,350 --> 00:59:03,160 What I'm saying is, what are the things 1415 00:59:03,160 --> 00:59:04,630 that you're trying to optimize? 1416 00:59:04,630 --> 00:59:07,630 There's a nice space of three different things-- 1417 00:59:07,630 --> 00:59:11,470 parallelism, locality, and redundant work. 1418 00:59:11,470 --> 00:59:14,620 My feeling is, as you go as a performance engineer, that's 1419 00:59:14,620 --> 00:59:16,785 going to be your life. 1420 00:59:16,785 --> 00:59:18,160 If I add additional things, there 1421 00:59:18,160 --> 00:59:20,827 might be algorithmic things that you completely get rid of well. 1422 00:59:20,827 --> 00:59:22,690 But most of the time, we are-- 1423 00:59:22,690 --> 00:59:24,490 all of you will be [INAUDIBLE] performance 1424 00:59:24,490 --> 00:59:28,100 will be working on some kind of multi-core vector GPU type 1425 00:59:28,100 --> 00:59:28,600 units. 1426 00:59:28,600 --> 00:59:30,220 You have to get parallelism. 1427 00:59:30,220 --> 00:59:32,380 So getting parallelism is important. 1428 00:59:32,380 --> 00:59:35,175 But then, if you don't have locality, it doesn't matter. 1429 00:59:35,175 --> 00:59:37,300 Because most of the time you're waiting to get data 1430 00:59:37,300 --> 00:59:38,508 from all the way from memory. 1431 00:59:38,508 --> 00:59:40,090 So you had to get good locality. 1432 00:59:40,090 --> 00:59:41,800 And then more-- a lot of times you 1433 00:59:41,800 --> 00:59:45,170 can do that really well if you do some extra computation. 1434 00:59:45,170 --> 00:59:47,410 But if you do too much extra things, that's going to, 1435 00:59:47,410 --> 00:59:48,800 oh well, that's not going to help you. 1436 00:59:48,800 --> 00:59:50,650 So it's all about playing the distribution. 1437 00:59:50,650 --> 00:59:51,790 You've got a final project. 1438 00:59:51,790 --> 00:59:53,330 That's exactly what you're going to do. 1439 00:59:53,330 --> 00:59:55,288 You might say, ah, if I can do some extra work, 1440 00:59:55,288 --> 00:59:56,590 OK, I can do this faster. 1441 00:59:56,590 --> 00:59:59,950 But oops, no, this extra pre-compute pass, or whatever, 1442 00:59:59,950 --> 01:00:00,490 it's not-- 1443 01:00:00,490 --> 01:00:01,870 I can't amortize the cost. 1444 01:00:01,870 --> 01:00:07,330 So there's these three things that you're trading over that. 1445 01:00:07,330 --> 01:00:10,040 So that's one interesting thing. 1446 01:00:10,040 --> 01:00:15,150 Another thing is we made it available for the programmers 1447 01:00:15,150 --> 01:00:18,220 to do this scheduling language. 1448 01:00:18,220 --> 01:00:20,685 But can you make it-- 1449 01:00:20,685 --> 01:00:22,060 can you think of a way to make it 1450 01:00:22,060 --> 01:00:24,230 a little bit easier for programmers 1451 01:00:24,230 --> 01:00:25,630 than doing a scheduling language? 1452 01:00:25,630 --> 01:00:27,038 What can I do? 1453 01:00:27,038 --> 01:00:29,080 What's the nice thing about scheduling languages? 1454 01:00:31,950 --> 01:00:34,090 It's very simple. 1455 01:00:34,090 --> 01:00:36,218 It has a very simple pattern. 1456 01:00:36,218 --> 01:00:39,483 AUDIENCE: [INAUDIBLE] 1457 01:00:39,483 --> 01:00:41,650 SAMAN AMARASIGNHE: Yeah, I mean, that's-- the number 1458 01:00:41,650 --> 01:00:42,340 of options-- 1459 01:00:42,340 --> 01:00:44,140 it's not like you can write any program. 1460 01:00:44,140 --> 01:00:47,480 There are certain things you can do in the schedule. 1461 01:00:47,480 --> 01:00:51,040 So if you know that space, you can sort of 1462 01:00:51,040 --> 01:00:52,790 use doing that smartly. 1463 01:00:52,790 --> 01:00:55,545 What else can we do with it? 1464 01:00:55,545 --> 01:00:56,735 AUDIENCE: Test them all. 1465 01:00:56,735 --> 01:00:58,110 SAMAN AMARASIGNHE: Test them all. 1466 01:00:58,110 --> 01:00:59,270 That's one approach there. 1467 01:00:59,270 --> 01:01:01,520 AUDIENCE: Use autotuning, trying to find [INAUDIBLE].. 1468 01:01:01,520 --> 01:01:03,187 SAMAN AMARASIGNHE: We can do autotuning. 1469 01:01:03,187 --> 01:01:07,160 So that switched into the autotuning part of this talk. 1470 01:01:07,160 --> 01:01:10,200 So performance engineering basically, most of the time, 1471 01:01:10,200 --> 01:01:13,350 is finding right these crazy things. 1472 01:01:13,350 --> 01:01:16,560 Like you start looking at, I think, probably 1473 01:01:16,560 --> 01:01:18,840 as Charles talked, what this voodoo parameters, 1474 01:01:18,840 --> 01:01:20,880 like, OK, what's the right block size? 1475 01:01:20,880 --> 01:01:23,220 And it has a big impact, finding that. 1476 01:01:23,220 --> 01:01:25,112 A newer memory allocation project, 1477 01:01:25,112 --> 01:01:26,570 you had to find the right strategy, 1478 01:01:26,570 --> 01:01:27,850 right memory allocation. 1479 01:01:27,850 --> 01:01:29,930 You searched through a bunch of these things. 1480 01:01:29,930 --> 01:01:33,600 You and GCC compiler, there are, I 1481 01:01:33,600 --> 01:01:38,400 think, about 400 different flags for GCC. 1482 01:01:38,400 --> 01:01:42,270 And you can actually get factor of four performance 1483 01:01:42,270 --> 01:01:47,330 by having a [INAUDIBLE] 200 flags into GCC. 1484 01:01:47,330 --> 01:01:48,960 It's crazy. 1485 01:01:48,960 --> 01:01:52,297 And that 200 flags is not the same for every program. 1486 01:01:52,297 --> 01:01:54,630 And, of course, some programs will crash in some places. 1487 01:01:54,630 --> 01:01:57,330 Most of the time, it'll slow down or speed up. 1488 01:01:57,330 --> 01:02:00,850 So you can just give all the flags of GCC and autotune that. 1489 01:02:00,850 --> 01:02:03,330 And you can get factor of two to four performance in there. 1490 01:02:03,330 --> 01:02:04,170 It's just crazy. 1491 01:02:04,170 --> 01:02:06,810 And then, because it do weird things in there, and then 01, 1492 01:02:06,810 --> 01:02:09,450 02, 03 will only do certain amount. 1493 01:02:09,450 --> 01:02:11,272 So 03 doesn't-- it's not always right. 1494 01:02:11,272 --> 01:02:12,480 You can try all these things. 1495 01:02:12,480 --> 01:02:14,400 So you can tune that. 1496 01:02:14,400 --> 01:02:18,210 And scheduling Halide, scheduling GraphIt, 1497 01:02:18,210 --> 01:02:21,620 all these things can be autotuned. 1498 01:02:21,620 --> 01:02:25,500 So before autotuning, when you have a large search space, 1499 01:02:25,500 --> 01:02:27,560 what do we normally do? 1500 01:02:27,560 --> 01:02:31,050 The thing that when we think we are smart, what we do 1501 01:02:31,050 --> 01:02:32,310 is we build models. 1502 01:02:32,310 --> 01:02:33,558 We have model for a cache. 1503 01:02:33,558 --> 01:02:35,100 And say, can we understand the cache? 1504 01:02:35,100 --> 01:02:37,080 We have all these nice models in here. 1505 01:02:37,080 --> 01:02:39,160 And using the model, I can predict, ah ha, 1506 01:02:39,160 --> 01:02:41,868 this is the right block size. 1507 01:02:41,868 --> 01:02:43,910 So what's the problem when you try to do a model? 1508 01:02:49,333 --> 01:02:52,300 AUDIENCE: Sometime it doesn't work [INAUDIBLE].. 1509 01:02:52,300 --> 01:02:54,100 SAMAN AMARASIGNHE: Exactly, because most 1510 01:02:54,100 --> 01:02:57,818 of the time, when you try to do a model, you have to abstract. 1511 01:02:57,818 --> 01:02:59,860 And most of the time, you-- what you abstract out 1512 01:02:59,860 --> 01:03:01,840 might be the most important part of the darn thing 1513 01:03:01,840 --> 01:03:02,840 that we didn't consider. 1514 01:03:02,840 --> 01:03:04,450 So we built a model for cache. 1515 01:03:04,450 --> 01:03:06,587 But oops, I could-- didn't figure out pages. 1516 01:03:06,587 --> 01:03:08,170 Well, the pages made a big difference. 1517 01:03:08,170 --> 01:03:09,940 So there might be things in real life 1518 01:03:09,940 --> 01:03:12,363 that matters that didn't fit into your model, 1519 01:03:12,363 --> 01:03:14,530 or you didn't know it needed to be fit into a model. 1520 01:03:14,530 --> 01:03:16,030 If you try to put everything, that's 1521 01:03:16,030 --> 01:03:17,260 too complicated of a model. 1522 01:03:17,260 --> 01:03:18,850 So you abstract something out. 1523 01:03:18,850 --> 01:03:22,830 You can say, I have optimal result for this model. 1524 01:03:22,830 --> 01:03:25,450 But that optimal result might be way off 1525 01:03:25,450 --> 01:03:27,070 than the simple result you can get. 1526 01:03:27,070 --> 01:03:29,785 Because the thing that you didn't put into 1527 01:03:29,785 --> 01:03:31,160 the model are the important ones. 1528 01:03:31,160 --> 01:03:32,500 So model doesn't work. 1529 01:03:32,500 --> 01:03:34,990 The next thing is what you do, heuristic-based thing. 1530 01:03:34,990 --> 01:03:37,780 This is where these old people come and say, 1531 01:03:37,780 --> 01:03:38,697 I know how to do this. 1532 01:03:38,697 --> 01:03:41,155 In order to do this, you need to do this thing, this thing, 1533 01:03:41,155 --> 01:03:41,860 this thing. 1534 01:03:41,860 --> 01:03:48,460 You can come up with some kind of the old grandmother's 1535 01:03:48,460 --> 01:03:49,420 solution type thing. 1536 01:03:49,420 --> 01:03:51,550 There are certain things that will always work. 1537 01:03:51,550 --> 01:03:52,870 And you hardcode them. 1538 01:03:52,870 --> 01:03:57,790 So you can say if the matrix dimension is more than 1,000, 1539 01:03:57,790 --> 01:04:02,133 always go to block, or some kind of rules like that. 1540 01:04:02,133 --> 01:04:03,550 These rules work most of the time. 1541 01:04:03,550 --> 01:04:08,820 But obviously, there are certain cases that rules doesn't work. 1542 01:04:08,820 --> 01:04:12,095 Worse, that rules might be set for a certain machine, 1543 01:04:12,095 --> 01:04:13,720 certain architecture, all those things. 1544 01:04:13,720 --> 01:04:14,720 I'll give you the story. 1545 01:04:14,720 --> 01:04:25,330 So GCC has this fast table sort routine. 1546 01:04:25,330 --> 01:04:29,380 So fast table sort routine says sort 1547 01:04:29,380 --> 01:04:31,930 using a parallel quick sort. 1548 01:04:31,930 --> 01:04:37,030 And when the number exceeds 16, switch to in session sort. 1549 01:04:37,030 --> 01:04:38,610 It's hardcoded in GCC. 1550 01:04:38,610 --> 01:04:41,590 It's like, wow, some amazing person figured 1551 01:04:41,590 --> 01:04:44,020 out 16, this amazing number, has to switch 1552 01:04:44,020 --> 01:04:45,890 from parallel quick sort to insertion sort. 1553 01:04:45,890 --> 01:04:47,473 So we are trying to figure out, what's 1554 01:04:47,473 --> 01:04:49,930 the profoundness of this number? 1555 01:04:49,930 --> 01:04:54,220 The profoundness of this number is somewhere around 1995 1556 01:04:54,220 --> 01:04:55,900 when this code was released. 1557 01:04:55,900 --> 01:04:58,960 In those machines, that was the right number. 1558 01:04:58,960 --> 01:05:02,170 That 16 was a really good number to switch from parallelism 1559 01:05:02,170 --> 01:05:04,840 to doing that, because the cache size, stuff like that. 1560 01:05:04,840 --> 01:05:07,030 But that 16 survived from 1995 [INAUDIBLE] 1561 01:05:07,030 --> 01:05:08,200 even two dates there. 1562 01:05:08,200 --> 01:05:12,100 Today that number should be like 500. 1563 01:05:12,100 --> 01:05:13,750 But it's in there, because somebody 1564 01:05:13,750 --> 01:05:15,792 thought 16 is the right, it's hardcoded in there. 1565 01:05:15,792 --> 01:05:17,080 It didn't change. 1566 01:05:17,080 --> 01:05:20,275 So there are a lot of things in compilers code like that, 1567 01:05:20,275 --> 01:05:22,720 that they draw, that some programmer said, 1568 01:05:22,720 --> 01:05:23,830 I know what works here. 1569 01:05:23,830 --> 01:05:24,740 This fits in there. 1570 01:05:24,740 --> 01:05:25,660 You put it in there. 1571 01:05:25,660 --> 01:05:27,070 But there's no rhyme or reason. 1572 01:05:27,070 --> 01:05:29,140 Because at that time, they had a reason. 1573 01:05:29,140 --> 01:05:30,430 But it doesn't scale. 1574 01:05:30,430 --> 01:05:35,115 So a lot of these heuristics get out of focus very fast. 1575 01:05:35,115 --> 01:05:36,490 And there's no theory behind this 1576 01:05:36,490 --> 01:05:37,927 to say, now, how do you update? 1577 01:05:37,927 --> 01:05:39,760 You had to ask people, why did you put that? 1578 01:05:39,760 --> 01:05:41,510 And then it's, oh yeah, because my machine 1579 01:05:41,510 --> 01:05:43,400 has a 32 kilobytes of cache. 1580 01:05:43,400 --> 01:05:45,550 It's like, oh, OK, that's a different machine 1581 01:05:45,550 --> 01:05:47,810 what we have today. 1582 01:05:47,810 --> 01:05:49,360 So that's the problem in here. 1583 01:05:49,360 --> 01:05:51,992 And then other thing is you can do exhaustive search. 1584 01:05:51,992 --> 01:05:53,950 You can say, OK, I'll try every possible thing. 1585 01:05:53,950 --> 01:05:56,740 The problem here is sometimes my search base 1586 01:05:56,740 --> 01:05:59,650 is 10 to the power 10. 1587 01:05:59,650 --> 01:06:03,820 You don't have enough seconds in your lifetime 1588 01:06:03,820 --> 01:06:04,840 to do that search. 1589 01:06:04,840 --> 01:06:06,132 So it would be too complicated. 1590 01:06:06,132 --> 01:06:09,310 And that's where the autotuner comes in. 1591 01:06:09,310 --> 01:06:12,060 So-- oh, OK, actually I have a little bit more slides here. 1592 01:06:12,060 --> 01:06:13,887 So model based solution is you come up 1593 01:06:13,887 --> 01:06:15,970 with this comprehensive model, like a cache model, 1594 01:06:15,970 --> 01:06:17,200 or something like that. 1595 01:06:17,200 --> 01:06:18,865 And you do that. 1596 01:06:18,865 --> 01:06:23,500 And you can exactly show what's right for the optimal solution 1597 01:06:23,500 --> 01:06:24,460 in here. 1598 01:06:24,460 --> 01:06:26,920 But the problem is hard to build models, 1599 01:06:26,920 --> 01:06:30,880 cannot model everything, and most of the time modeling is 1600 01:06:30,880 --> 01:06:33,592 the most important thing. 1601 01:06:33,592 --> 01:06:35,050 Heuristic-based things are the rule 1602 01:06:35,050 --> 01:06:37,342 of the thumb kind of solution that you come up and say, 1603 01:06:37,342 --> 01:06:39,220 it's hardcoded in there. 1604 01:06:39,220 --> 01:06:41,110 And it's very simple and easy. 1605 01:06:41,110 --> 01:06:43,660 It works most of the time, if you get it right. 1606 01:06:43,660 --> 01:06:45,130 But the problem is too simplistic. 1607 01:06:45,130 --> 01:06:46,060 It doesn't scale. 1608 01:06:46,060 --> 01:06:50,400 It doesn't stand the test of time, most of the time in here. 1609 01:06:50,400 --> 01:06:52,640 An exhaustive search is great. 1610 01:06:52,640 --> 01:06:56,980 But the problem is just way too much 1611 01:06:56,980 --> 01:07:02,290 possibility of searching in here, too big of search base. 1612 01:07:02,290 --> 01:07:03,440 You can't do that. 1613 01:07:03,440 --> 01:07:06,760 So this is where you want to prune the search base. 1614 01:07:06,760 --> 01:07:08,560 And the pruning, the best way to do that 1615 01:07:08,560 --> 01:07:11,290 is basically use autotuning. 1616 01:07:11,290 --> 01:07:13,870 So what autotuning you can do is you can define the space 1617 01:07:13,870 --> 01:07:17,980 of acceptable values nicely, choose a value at random-- 1618 01:07:17,980 --> 01:07:20,980 that's what the system will do, try it out there-- 1619 01:07:20,980 --> 01:07:23,230 and evaluate the performance of that value end to end. 1620 01:07:23,230 --> 01:07:24,340 Because end to end matters. 1621 01:07:24,340 --> 01:07:26,340 Because if you try to predict, most of the time, 1622 01:07:26,340 --> 01:07:27,650 it might not work. 1623 01:07:27,650 --> 01:07:33,820 And if satisfies the performance that you need, you're done. 1624 01:07:33,820 --> 01:07:36,030 Otherwise, choose a new value and iterate over there, 1625 01:07:36,030 --> 01:07:36,870 go to three in there. 1626 01:07:36,870 --> 01:07:38,090 So this is the kind of thing. 1627 01:07:38,090 --> 01:07:39,970 And what you have to do is you need 1628 01:07:39,970 --> 01:07:45,430 to have a system to figure out how to do this fast, basically 1629 01:07:45,430 --> 01:07:49,120 what space to basically do that, when to basically think you're 1630 01:07:49,120 --> 01:07:52,220 done, how to go through the iterating loops through that. 1631 01:07:52,220 --> 01:07:55,250 So this is the kind of-- with cartoonish way, what 1632 01:07:55,250 --> 01:07:56,930 happens is you give a value candidate, 1633 01:07:56,930 --> 01:07:59,720 you compile the program, you run it with bunch of data, 1634 01:07:59,720 --> 01:08:00,620 you are running through the bunch, 1635 01:08:00,620 --> 01:08:01,870 otherwise you are all fitting. 1636 01:08:01,870 --> 01:08:03,110 You can't run it with one. 1637 01:08:03,110 --> 01:08:04,110 And you get the results. 1638 01:08:04,110 --> 01:08:05,900 And you get something like average. 1639 01:08:05,900 --> 01:08:09,800 And if you go through this loop in here. 1640 01:08:09,800 --> 01:08:13,100 And what OpenTuner has done is come up 1641 01:08:13,100 --> 01:08:14,910 with the ensemble of techniques. 1642 01:08:14,910 --> 01:08:18,319 So the idea there is when you're searching through a space, 1643 01:08:18,319 --> 01:08:21,560 you might be at the bottom of a hill of the space. 1644 01:08:21,560 --> 01:08:24,050 So what that means is there are certain value, if you keep 1645 01:08:24,050 --> 01:08:25,580 improving in value, that you are getting good 1646 01:08:25,580 --> 01:08:26,740 better, and better, and better. 1647 01:08:26,740 --> 01:08:28,840 And at that time, something like a hill climber-- 1648 01:08:32,510 --> 01:08:35,548 hill climber, or somebody like an [INAUDIBLE] hill climber 1649 01:08:35,548 --> 01:08:37,340 can actually give you the best performance. 1650 01:08:37,340 --> 01:08:38,673 You're going very fast in there. 1651 01:08:38,673 --> 01:08:41,782 But meet you at the top of the hill for that pyramid, oops, 1652 01:08:41,782 --> 01:08:42,740 there's no place to go. 1653 01:08:42,740 --> 01:08:44,450 So that if you tried to hill climb, 1654 01:08:44,450 --> 01:08:45,710 it's not going to be helpful. 1655 01:08:45,710 --> 01:08:47,335 So at that time, what do you want to do 1656 01:08:47,335 --> 01:08:50,760 is do something like a random search in here. 1657 01:08:50,760 --> 01:08:55,240 So what this system do, OpenTuner, you do it, 1658 01:08:55,240 --> 01:08:57,830 it will basically test this request in there. 1659 01:08:57,830 --> 01:09:00,260 And if something is doing very well, 1660 01:09:00,260 --> 01:09:02,840 it will give it more time. 1661 01:09:02,840 --> 01:09:06,020 If not, what it will do is it will say, OK, look, 1662 01:09:06,020 --> 01:09:08,128 this is-- this technique is not working. 1663 01:09:08,128 --> 01:09:09,170 Let's try something else. 1664 01:09:09,170 --> 01:09:10,920 It'll basically allocate the time in here. 1665 01:09:10,920 --> 01:09:15,950 So it do this search much faster than otherwise you can do. 1666 01:09:15,950 --> 01:09:18,319 So I want to finish this by showing what you need 1667 01:09:18,319 --> 01:09:20,463 for autotuning for GraphIt. 1668 01:09:20,463 --> 01:09:22,380 So we have algorithm, and you have a schedule. 1669 01:09:22,380 --> 01:09:25,202 It's a pain to write this schedule. 1670 01:09:25,202 --> 01:09:26,910 In fact, there's a good [? interesting ?] 1671 01:09:26,910 --> 01:09:30,500 thing in-- when you do Halide, we 1672 01:09:30,500 --> 01:09:34,640 decided, OK, it should be similar to write the algorithm 1673 01:09:34,640 --> 01:09:36,290 for Halide and the schedule. 1674 01:09:36,290 --> 01:09:39,580 Google very fast realized many people won't use Halide. 1675 01:09:39,580 --> 01:09:42,740 And they-- at about two years, they had about hundreds 1676 01:09:42,740 --> 01:09:45,109 of programmers who can write the algorithm. 1677 01:09:45,109 --> 01:09:46,880 But they only had five people who could 1678 01:09:46,880 --> 01:09:48,380 write the really good schedule. 1679 01:09:48,380 --> 01:09:49,880 To write a really good schedule, you 1680 01:09:49,880 --> 01:09:51,420 need to understand a little bit of the algorithm. 1681 01:09:51,420 --> 01:09:52,950 You need to understand a little bit of the architecture, 1682 01:09:52,950 --> 01:09:54,229 a little bit of everything. 1683 01:09:54,229 --> 01:09:56,860 And that's much harder for people to learn. 1684 01:09:56,860 --> 01:10:00,362 So getting the schedule right is not that easy. 1685 01:10:00,362 --> 01:10:01,820 And same thing in here, because you 1686 01:10:01,820 --> 01:10:05,050 need to understand a lot unless you do kind of random-- 1687 01:10:05,050 --> 01:10:08,330 you've got certain arbitrary, but to do it right, 1688 01:10:08,330 --> 01:10:10,130 you need to know a little bit more. 1689 01:10:10,130 --> 01:10:13,550 So what we can do is we can basically 1690 01:10:13,550 --> 01:10:17,142 give some idea about the graphs and some idea 1691 01:10:17,142 --> 01:10:18,350 about the algorithm in there. 1692 01:10:18,350 --> 01:10:19,892 We can autotune these things in there 1693 01:10:19,892 --> 01:10:23,120 and then generate the schedule. 1694 01:10:23,120 --> 01:10:26,300 And so what we found was to generate this schedule, if you 1695 01:10:26,300 --> 01:10:28,890 do exhaustive search, it runs for days. 1696 01:10:28,890 --> 01:10:31,580 But if you're using autotuner, OpenTuner, 1697 01:10:31,580 --> 01:10:34,820 you can find a really good schedule for-- 1698 01:10:34,820 --> 01:10:37,040 in less than two hours. 1699 01:10:37,040 --> 01:10:40,010 And, in fact, a few cases we found schedules 1700 01:10:40,010 --> 01:10:42,798 that run better than what we thought 1701 01:10:42,798 --> 01:10:44,090 was the best possible schedule. 1702 01:10:44,090 --> 01:10:46,250 Because it was able to-- 1703 01:10:46,250 --> 01:10:50,960 because it was able to search much better than our intuition 1704 01:10:50,960 --> 01:10:51,860 would say in here. 1705 01:10:51,860 --> 01:10:53,960 And when-- and even if our intuition know it, 1706 01:10:53,960 --> 01:10:56,420 it has more time to try many different combinations 1707 01:10:56,420 --> 01:10:58,950 and trying something in-- come something better in here. 1708 01:10:58,950 --> 01:11:03,850 So that's all I have today.