1 00:00:00,020 --> 00:00:05,740 all right thank you all for coming and I really want to say a thank you to 2 00:00:05,760 --> 00:00:09,660 AGU you and of course to NASA for having me and for sponsoring this whole 3 00:00:09,660 --> 00:00:14,000 competition. I'm really excited to share this presentation which is going to talk 4 00:00:14,000 --> 00:00:19,260 about a GPU acceleration for data visualization of NASA Earth science data 5 00:00:19,260 --> 00:00:21,440 and also for machine learning 6 00:00:24,980 --> 00:00:28,280 earth scientists today have a sort of problem 7 00:00:28,289 --> 00:00:31,170 that they never really thought they'd have which is simply having too much 8 00:00:31,170 --> 00:00:36,809 data right now there are about 50 NASA or science satellites in orbit each 9 00:00:36,809 --> 00:00:41,430 producing terabytes of data every single day and I think by 2022 the amount of 10 00:00:41,430 --> 00:00:44,940 new data that's actually being produced is going to increase to about a hundred 11 00:00:44,940 --> 00:00:50,670 and fifty terabytes a day and obviously even though having more data is great it 12 00:00:50,670 --> 00:00:54,420 doesn't really mean very much unless people can understand it people can find 13 00:00:54,420 --> 00:00:58,540 patterns in it and people are able to really understand what their data says 14 00:00:59,940 --> 00:01:04,380 the process of actually visualizing data which I think is essential to the 15 00:01:04,380 --> 00:01:08,310 process of understanding data is already quite complicated, because this data is 16 00:01:08,310 --> 00:01:12,030 so big there's a lot that has to go into being able to make it small enough and 17 00:01:12,030 --> 00:01:16,520 manageable enough to visualize. Originally when you have data it's 18 00:01:16,540 --> 00:01:21,800 unstructured, it needs to be gridded, it needs to be projected onto a map and you 19 00:01:21,800 --> 00:01:25,500 need to generate these tile pyramids which are basically lower resolution 20 00:01:25,500 --> 00:01:30,180 versions of the original data which can be viewed on a web browser and 21 00:01:30,180 --> 00:01:34,890 manipulated this process is great but it's quite slow it takes a lot of 22 00:01:34,890 --> 00:01:39,329 storage and it's not interactive meaning that if you want to change details of 23 00:01:39,329 --> 00:01:42,509 the data if you want to view it using a different color map, a different scale, 24 00:01:42,509 --> 00:01:47,460 you would like to apply algorithms or resample it, that becomes very very 25 00:01:47,460 --> 00:01:51,740 difficult, at the same time there are a lot of questions that you'd like to ask 26 00:01:51,740 --> 00:01:55,020 about your data they can be more quantitative in nature as opposed to 27 00:01:55,020 --> 00:02:00,300 visualization, this can include seeing time series or exploratory data analysis 28 00:02:00,300 --> 00:02:04,890 with clustering or even applying machine learning algorithms and because data can 29 00:02:04,890 --> 00:02:10,040 be so large right now because this data can be gigabytes and gigabytes these 30 00:02:10,040 --> 00:02:13,460 algorithms tend to be very slow if you want to see a time series 31 00:02:13,470 --> 00:02:19,920 that spans years and years it can take hours to compute and I think that one of 32 00:02:19,920 --> 00:02:23,490 the most essential things for really being able to understand data is having 33 00:02:23,490 --> 00:02:27,720 a kind of intuitive data exploration process where as a scientist you form 34 00:02:27,720 --> 00:02:32,070 questions about your data. You say is there a trend here in this small part of 35 00:02:32,070 --> 00:02:36,570 the world, do I see patterns, do I see trends and you can ask those questions 36 00:02:36,570 --> 00:02:41,760 then have them answer it intuitively you can explore and manipulate in a really 37 00:02:41,760 --> 00:02:44,820 intuitive and flexible and powerful way 38 00:02:49,580 --> 00:02:54,750 right now current methods are all CPU based which means that they use this 39 00:02:54,750 --> 00:02:58,500 very traditional method of doing computation that iterates over every 40 00:02:58,500 --> 00:03:02,070 single data point in an image or in a data set and applies some kind of 41 00:03:02,070 --> 00:03:07,280 algorithm. GPUs are a new kind of hardware which allow you to apply 42 00:03:07,280 --> 00:03:12,460 algorithms in parallel across all of the data points in your data set that, means 43 00:03:12,460 --> 00:03:17,180 that if you want to apply a color map or reproject your data or ask questions 44 00:03:17,190 --> 00:03:21,269 like what is the mean or the average or how does this change over time all of 45 00:03:21,269 --> 00:03:25,170 that can be done almost instantaneously because the computations are actually 46 00:03:25,170 --> 00:03:29,070 performed in parallel on all of your data, these technologies are widely used 47 00:03:29,070 --> 00:03:33,300 today in machine learning and in computer graphics but they have not been 48 00:03:33,300 --> 00:03:37,720 adopted as widely in the earth science community 49 00:03:40,020 --> 00:03:42,060 Fundamentally that's what this 50 00:03:42,060 --> 00:03:47,040 presentation is about it is presenting a new GPU visualization prototype a kind 51 00:03:47,040 --> 00:03:50,940 of software library that allows you to work directly with raw earth science 52 00:03:50,940 --> 00:03:55,620 data. You have new data that has just come down from a satellite its 53 00:03:55,620 --> 00:04:00,090 unstructured it hasn't been gridded, it hasn't been processed and you can simply 54 00:04:00,090 --> 00:04:05,610 load that and grid the data, you can project it, you can rescale it, you can 55 00:04:05,610 --> 00:04:09,630 choose for instance a logarithmic scale instead of a linear scale, you can apply 56 00:04:09,630 --> 00:04:14,130 recent machine learning algorithms for visualization or find data analysis and 57 00:04:14,130 --> 00:04:18,120 correlation and all of this can be done extremely quickly and extremely flexibly 58 00:04:18,120 --> 00:04:23,460 using this new technology questions that currently with technologies like the 59 00:04:23,460 --> 00:04:28,400 science data analytics platform or with NASA Gibbs things that are 60 00:04:28,400 --> 00:04:32,419 either impossible or take a very long time can be done almost instantly and 61 00:04:32,419 --> 00:04:37,370 that that speed is what makes I think data exploration really powerful and 62 00:04:37,370 --> 00:04:38,480 really easy to do 63 00:04:40,400 --> 00:04:44,060 here's an example using NASA MIR surface temperature 64 00:04:44,080 --> 00:04:48,180 data all of this data is generated dynamically on the fly at multiple 65 00:04:48,199 --> 00:04:50,479 resolutions and you can view it in a web browser 66 00:04:50,479 --> 00:04:54,290 so these multi resolution tiles which before would have to be pre generated 67 00:04:54,290 --> 00:04:57,919 and pre-processed taking hours and additional storage can be generated in 68 00:04:57,919 --> 00:05:03,320 real time on the GPU tiles can also be generated directly from l1 or l2 data 69 00:05:03,320 --> 00:05:07,280 meaning that it doesn't even need to be pre processed and that means that you 70 00:05:07,280 --> 00:05:10,729 can actually reduce the amount of storage that you need, here for instance 71 00:05:10,729 --> 00:05:13,970 you can view this in a logarithmic scale which is something that's completely 72 00:05:13,970 --> 00:05:18,770 impossible now with data that has already been gridded and cut applied a 73 00:05:18,770 --> 00:05:21,620 color map has already been implied because you're actually working with 74 00:05:21,620 --> 00:05:25,430 binned data instead of the raw scientific data so by working directly 75 00:05:25,430 --> 00:05:30,860 with raw data you're actually able to work with full scientific fidelity while 76 00:05:30,860 --> 00:05:34,610 you're building these visualizations, here's another example where you're 77 00:05:34,610 --> 00:05:38,000 actually just able to zoom in in a region of interest and because you're 78 00:05:38,000 --> 00:05:41,690 working directly with the raw data you can view for instance the mean, the max 79 00:05:41,690 --> 00:05:46,400 the standard deviation and the variance of the region that you're viewing in the 80 00:05:46,400 --> 00:05:51,380 current view, so simply by zooming in on a region of the map you're able to ask 81 00:05:51,380 --> 00:05:55,760 questions about how does the temperature in this region differ from other parts 82 00:05:55,760 --> 00:06:00,680 of the world how does that change, here also for instance you can apply computer 83 00:06:00,680 --> 00:06:05,870 vision filters this is a sobel filter which shows gradients basically regions 84 00:06:05,870 --> 00:06:10,550 in this ocean where there are large temperature gradients and this can be 85 00:06:10,550 --> 00:06:16,160 generated from the raw data just as easily as viewing any other kind of NASA 86 00:06:16,160 --> 00:06:20,540 data so by working directly with raw data, you really have this advantage of 87 00:06:20,540 --> 00:06:25,800 being able to manipulate and visualize your data in real time flexibly and intuitively 88 00:06:28,360 --> 00:06:32,240 as sort of an example of the way that this can be used in earth 89 00:06:32,240 --> 00:06:37,460 science. We're looking at some NASA ocean data we have the MIR surface 90 00:06:37,460 --> 00:06:41,330 temperature and AVHARR temperature anomaly and then a variety 91 00:06:41,330 --> 00:06:46,400 of ECCO products and we can ask questions about these, we can say is there a cor... 92 00:06:46,400 --> 00:06:50,900 other correlations in these products, how do they change over time, are there 93 00:06:50,900 --> 00:06:55,400 variants, are there important differences within different parts of the world and 94 00:06:55,400 --> 00:07:01,940 simply by visualizing it conveniently we can answer these questions almost 95 00:07:01,940 --> 00:07:06,800 instantly here is an example we can ask questions about the El Nino and La Nina 96 00:07:06,800 --> 00:07:13,130 effects, so this is part of this climate cycle in the Southern Pacific Ocean that 97 00:07:13,130 --> 00:07:16,820 occurs off the coast of Ecuador it's a approximately four year cycle 98 00:07:16,820 --> 00:07:22,430 where this band off the coast of South America goes through higher than average 99 00:07:22,430 --> 00:07:27,440 and lower than average temperature gradients, so in 2015 there was a very 100 00:07:27,440 --> 00:07:32,300 intense El Nino period where the temperature was increased by as much as 101 00:07:32,300 --> 00:07:36,050 3 degrees Celsius compared to the average the advantage of this kind of 102 00:07:36,050 --> 00:07:40,480 technology is that you can actually generate this anomaly data that is how 103 00:07:40,490 --> 00:07:45,340 much does the temperature differ from historic patterns without precomputing 104 00:07:45,349 --> 00:07:51,349 it you can simply say look at past data look at the current data and D season or 105 00:07:51,349 --> 00:07:55,490 remove this historic patterns so you can see this dynamic data without actually 106 00:07:55,490 --> 00:07:59,840 computing or storing this anomaly data separately so 107 00:08:00,700 --> 00:08:02,810 here for instance we have 108 00:08:02,810 --> 00:08:07,250 this a AVHRR sea surface temperature data we're off the coast of South 109 00:08:07,250 --> 00:08:12,470 America right now and we can see that all of this data is dynamically 110 00:08:12,470 --> 00:08:16,490 generated so you aren't using gigabytes and gigabytes of additional data and you 111 00:08:16,490 --> 00:08:20,360 can simply zoom in and out dynamically like you could with a NASA product like 112 00:08:20,360 --> 00:08:24,500 world view you can see how it looks you can experiment you can move around the 113 00:08:24,500 --> 00:08:28,400 map and work really flexibly here for instance we're just jumping day after 114 00:08:28,400 --> 00:08:31,910 day all of this imagery is generated immediately within a hundred 115 00:08:31,910 --> 00:08:36,260 milliseconds working from the raw unstructured data, so rather than taking 116 00:08:36,260 --> 00:08:39,979 hours to pre-process and hours to say how am I going to solve this question 117 00:08:39,979 --> 00:08:47,209 with GPUs you're able to actually just ask a question and then have an answer I 118 00:08:47,209 --> 00:08:50,870 think that's really essential for visualization sometimes you also have 119 00:08:50,870 --> 00:08:53,270 more quantitative questions you want to say 120 00:08:53,270 --> 00:08:57,440 how has this changed over time and simply by selecting the region of 121 00:08:57,440 --> 00:09:03,860 interest you immediately have a time series that spans as much as in this 122 00:09:03,860 --> 00:09:08,000 case about six years or four years excuse me you can zoom in and you can 123 00:09:08,000 --> 00:09:12,440 immediately see that in the later part of 2015 in this specific region of 124 00:09:12,440 --> 00:09:17,029 interest you have a dramatic increase in the sea surface temperature anomaly 125 00:09:17,029 --> 00:09:20,959 again all of this is generated just from the raw sea surface temperature data so 126 00:09:20,959 --> 00:09:24,770 we've generated the anomaly data we are able to zoom in and out you're able to 127 00:09:24,770 --> 00:09:29,630 select regions ask questions say ok how actually how dramatic is this El Nino 128 00:09:29,630 --> 00:09:34,520 effect does this occur and you get those answers very quickly and so I think that 129 00:09:34,520 --> 00:09:38,510 this technologies are really a remarkable way of improving the 130 00:09:38,510 --> 00:09:39,800 scientific process 131 00:09:41,040 --> 00:09:46,850 finally because GPUs are most widely used for machine 132 00:09:46,850 --> 00:09:51,740 learning which has become a huge field these days with new breakthroughs being 133 00:09:51,740 --> 00:09:58,040 done everyday in computer vision, anomaly detection, language, video all of these 134 00:09:58,040 --> 00:10:01,790 algorithms are implemented on the GPU and they can be computed extremely 135 00:10:01,790 --> 00:10:06,649 quickly on the GPU and so because the GPU is actually used as part of the 136 00:10:06,649 --> 00:10:10,220 visualization pipeline new algorithms can just be dropped in and out and 137 00:10:10,220 --> 00:10:14,990 applied as part of the process here for instance you have on the leftmost column 138 00:10:14,990 --> 00:10:20,329 this MODIS truecolor data which is optical data taken of the whole world 139 00:10:20,329 --> 00:10:24,140 almost every day, but there are parts of the world where there actually is no 140 00:10:24,140 --> 00:10:27,680 data because a satellite never passed directly over that region and the 141 00:10:27,680 --> 00:10:31,220 remarkable thing is that looking only at the leftmost column you're able to 142 00:10:31,220 --> 00:10:37,760 predict the kinds of values that would occur had you had ax had you had a 143 00:10:37,760 --> 00:10:41,149 satellite over that region and you're able to fill it in as part of the 144 00:10:41,149 --> 00:10:45,380 visualization process so if you want to see just a beautiful view of the entire 145 00:10:45,380 --> 00:10:50,660 world simply tile by tile in less than 100 milliseconds you're able to apply 146 00:10:50,660 --> 00:10:54,410 these state-of-the-art machine learning algorithms fill in these details and 147 00:10:54,410 --> 00:10:59,570 view this the whole world with this inferred data and you can also have more 148 00:10:59,570 --> 00:11:03,709 scientifically relevant applications where you infer missing values in data 149 00:11:03,709 --> 00:11:07,070 you find anomalies, you say if you have access to data 150 00:11:07,070 --> 00:11:10,990 from depths in the ocean you can fill in those gaps by using machine learning and 151 00:11:10,990 --> 00:11:15,740 I think that having this unified pipeline that combines visualization 152 00:11:15,740 --> 00:11:19,640 analysis and machine learning is I think a really essential and powerful 153 00:11:19,640 --> 00:11:24,120 combination of these technologies that makes visualization very useful 154 00:11:25,340 --> 00:11:30,300 finally and very quickly this can all be done in the cloud so you don't actually have to 155 00:11:30,300 --> 00:11:35,320 have a GPU yourself you can simply go to a website and view all of this data 156 00:11:35,330 --> 00:11:42,380 dynamically and in terms of speed and performance this can be just 157 00:11:42,380 --> 00:11:46,190 dramatically faster than anything that exists currently for generating 158 00:11:46,190 --> 00:11:51,710 individual tiles it can be as much as a hundred times faster analytics can be as 159 00:11:51,710 --> 00:11:57,410 much as a thousand times faster using this GPU acceleration so having this 160 00:11:57,410 --> 00:12:01,040 whole pipeline is flexible, it's powerful and I think it allows you to answer 161 00:12:01,040 --> 00:12:06,110 questions that are either very difficult to do or to make observations that you 162 00:12:06,110 --> 00:12:10,480 might never have thought to make because you're able to interact seamlessly 163 00:12:10,480 --> 00:12:15,910 fluidly and answer questions almost as soon as you have them so I want to thank 164 00:12:15,910 --> 00:12:20,300 JPL for having me the summer where I did a lot of this work and then of course 165 00:12:20,300 --> 00:12:25,940 NASA and AGU for this sponsoring this competition and for this award and this 166 00:12:25,940 --> 00:12:31,149 wonderful conference so thank you very much