1
00:00:00,020 --> 00:00:05,740
all right thank you all for coming
and I really want to say a thank you to

2
00:00:05,760 --> 00:00:09,660
AGU you and of course to NASA for having
me and for sponsoring this whole

3
00:00:09,660 --> 00:00:14,000
competition. I'm really excited to share
this presentation which is going to talk

4
00:00:14,000 --> 00:00:19,260
about a GPU acceleration for data
visualization of NASA Earth science data

5
00:00:19,260 --> 00:00:21,440
and also for machine learning

6
00:00:24,980 --> 00:00:28,280
earth
scientists today have a sort of problem

7
00:00:28,289 --> 00:00:31,170
that they never really thought they'd
have which is simply having too much

8
00:00:31,170 --> 00:00:36,809
data right now there are about 50 NASA
or science satellites in orbit each

9
00:00:36,809 --> 00:00:41,430
producing terabytes of data every single
day and I think by 2022 the amount of

10
00:00:41,430 --> 00:00:44,940
new data that's actually being produced
is going to increase to about a hundred

11
00:00:44,940 --> 00:00:50,670
and fifty terabytes a day and obviously
even though having more data is great it

12
00:00:50,670 --> 00:00:54,420
doesn't really mean very much unless
people can understand it people can find

13
00:00:54,420 --> 00:00:58,540
patterns in it and people are able to
really understand what their data says

14
00:00:59,940 --> 00:01:04,380
the process of actually visualizing data
which I think is essential to the

15
00:01:04,380 --> 00:01:08,310
process of understanding data is already
quite complicated, because this data is

16
00:01:08,310 --> 00:01:12,030
so big there's a lot that has to go into
being able to make it small enough and

17
00:01:12,030 --> 00:01:16,520
manageable enough to visualize.
Originally when you have data it's

18
00:01:16,540 --> 00:01:21,800
unstructured, it needs to be gridded, it
needs to be projected onto a map and you

19
00:01:21,800 --> 00:01:25,500
need to generate these tile pyramids
which are basically lower resolution

20
00:01:25,500 --> 00:01:30,180
versions of the original data which can
be viewed on a web browser and

21
00:01:30,180 --> 00:01:34,890
manipulated this process is great but
it's quite slow it takes a lot of

22
00:01:34,890 --> 00:01:39,329
storage and it's not interactive meaning
that if you want to change details of

23
00:01:39,329 --> 00:01:42,509
the data if you want to view it using a
different color map, a different scale,

24
00:01:42,509 --> 00:01:47,460
you would like to apply algorithms or
resample it, that becomes very very

25
00:01:47,460 --> 00:01:51,740
difficult, at the same time there are a
lot of questions that you'd like to ask

26
00:01:51,740 --> 00:01:55,020
about your data they can be more
quantitative in nature as opposed to

27
00:01:55,020 --> 00:02:00,300
visualization, this can include seeing
time series or exploratory data analysis

28
00:02:00,300 --> 00:02:04,890
with clustering or even applying machine
learning algorithms and because data can

29
00:02:04,890 --> 00:02:10,040
be so large right now because this data
can be gigabytes and gigabytes these

30
00:02:10,040 --> 00:02:13,460
algorithms tend to be very slow if you
want to see a time series

31
00:02:13,470 --> 00:02:19,920
that spans years and years it can take
hours to compute and I think that one of

32
00:02:19,920 --> 00:02:23,490
the most essential things for really
being able to understand data is having

33
00:02:23,490 --> 00:02:27,720
a kind of intuitive data exploration
process where as a scientist you form

34
00:02:27,720 --> 00:02:32,070
questions about your data. You say is
there a trend here in this small part of

35
00:02:32,070 --> 00:02:36,570
the world, do I see patterns, do I see
trends and you can ask those questions

36
00:02:36,570 --> 00:02:41,760
then have them answer it intuitively you
can explore and manipulate in a really

37
00:02:41,760 --> 00:02:44,820
intuitive and flexible and powerful way

38
00:02:49,580 --> 00:02:54,750
right now current methods are all CPU
based which means that they use this

39
00:02:54,750 --> 00:02:58,500
very traditional method of doing
computation that iterates over every

40
00:02:58,500 --> 00:03:02,070
single data point in an image or in a
data set and applies some kind of

41
00:03:02,070 --> 00:03:07,280
algorithm. GPUs are a new kind of
hardware which allow you to apply

42
00:03:07,280 --> 00:03:12,460
algorithms in parallel across all of the
data points in your data set that, means

43
00:03:12,460 --> 00:03:17,180
that if you want to apply a color map or
reproject your data or ask questions

44
00:03:17,190 --> 00:03:21,269
like what is the mean or the average or
how does this change over time all of

45
00:03:21,269 --> 00:03:25,170
that can be done almost instantaneously
because the computations are actually

46
00:03:25,170 --> 00:03:29,070
performed in parallel on all of your
data, these technologies are widely used

47
00:03:29,070 --> 00:03:33,300
today in machine learning and in
computer graphics but they have not been

48
00:03:33,300 --> 00:03:37,720
adopted as widely in the earth science
community

49
00:03:40,020 --> 00:03:42,060
Fundamentally that's what this

50
00:03:42,060 --> 00:03:47,040
presentation is about it is presenting a
new GPU visualization prototype a kind

51
00:03:47,040 --> 00:03:50,940
of software library that allows you to
work directly with raw earth science

52
00:03:50,940 --> 00:03:55,620
data. You have new data that has just
come down from a satellite its

53
00:03:55,620 --> 00:04:00,090
unstructured it hasn't been gridded, it
hasn't been processed and you can simply

54
00:04:00,090 --> 00:04:05,610
load that and grid the data, you can
project it, you can rescale it, you can

55
00:04:05,610 --> 00:04:09,630
choose for instance a logarithmic scale
instead of a linear scale, you can apply

56
00:04:09,630 --> 00:04:14,130
recent machine learning algorithms for
visualization or find data analysis and

57
00:04:14,130 --> 00:04:18,120
correlation and all of this can be done
extremely quickly and extremely flexibly

58
00:04:18,120 --> 00:04:23,460
using this new technology questions that
currently with technologies like the

59
00:04:23,460 --> 00:04:28,400
science data analytics platform
or with NASA Gibbs things that are

60
00:04:28,400 --> 00:04:32,419
either impossible or take a very long
time can be done almost instantly and

61
00:04:32,419 --> 00:04:37,370
that that speed is what makes I think
data exploration really powerful and

62
00:04:37,370 --> 00:04:38,480
really easy to do

63
00:04:40,400 --> 00:04:44,060
here's an example
using NASA MIR surface temperature

64
00:04:44,080 --> 00:04:48,180
data all of this data is generated
dynamically on the fly at multiple

65
00:04:48,199 --> 00:04:50,479
resolutions and you can view it in a web
browser

66
00:04:50,479 --> 00:04:54,290
so these multi resolution tiles which
before would have to be pre generated

67
00:04:54,290 --> 00:04:57,919
and pre-processed taking hours and
additional storage can be generated in

68
00:04:57,919 --> 00:05:03,320
real time on the GPU tiles can also be
generated directly from l1 or l2 data

69
00:05:03,320 --> 00:05:07,280
meaning that it doesn't even need to be
pre processed and that means that you

70
00:05:07,280 --> 00:05:10,729
can actually reduce the amount of
storage that you need, here for instance

71
00:05:10,729 --> 00:05:13,970
you can view this in a logarithmic scale
which is something that's completely

72
00:05:13,970 --> 00:05:18,770
impossible now with data that has
already been gridded and cut applied a

73
00:05:18,770 --> 00:05:21,620
color map has already been implied
because you're actually working with

74
00:05:21,620 --> 00:05:25,430
binned data instead of the raw
scientific data so by working directly

75
00:05:25,430 --> 00:05:30,860
with raw data you're actually able to
work with full scientific fidelity while

76
00:05:30,860 --> 00:05:34,610
you're building these visualizations,
here's another example where you're

77
00:05:34,610 --> 00:05:38,000
actually just able to zoom in in a
region of interest and because you're

78
00:05:38,000 --> 00:05:41,690
working directly with the raw data you
can view for instance the mean, the max

79
00:05:41,690 --> 00:05:46,400
the standard deviation and the variance
of the region that you're viewing in the

80
00:05:46,400 --> 00:05:51,380
current view, so simply by zooming in on
a region of the map you're able to ask

81
00:05:51,380 --> 00:05:55,760
questions about how does the temperature
in this region differ from other parts

82
00:05:55,760 --> 00:06:00,680
of the world how does that change, here
also for instance you can apply computer

83
00:06:00,680 --> 00:06:05,870
vision filters this is a sobel filter
which shows gradients basically regions

84
00:06:05,870 --> 00:06:10,550
in this ocean where there are large
temperature gradients and this can be

85
00:06:10,550 --> 00:06:16,160
generated from the raw data just as
easily as viewing any other kind of NASA

86
00:06:16,160 --> 00:06:20,540
data so by working directly with raw
data, you really have this advantage of

87
00:06:20,540 --> 00:06:25,800
being able to manipulate and visualize
your data in real time flexibly and intuitively

88
00:06:28,360 --> 00:06:32,240
as sort of an example of the
way that this can be used in earth

89
00:06:32,240 --> 00:06:37,460
science. We're looking at some NASA ocean
data we have the MIR surface

90
00:06:37,460 --> 00:06:41,330
temperature and AVHARR
temperature anomaly and then a variety

91
00:06:41,330 --> 00:06:46,400
of ECCO products and we can ask questions
about these, we can say is there a cor...

92
00:06:46,400 --> 00:06:50,900
other correlations in these products, how
do they change over time, are there

93
00:06:50,900 --> 00:06:55,400
variants, are there important differences
within different parts of the world and

94
00:06:55,400 --> 00:07:01,940
simply by visualizing it conveniently we
can answer these questions almost

95
00:07:01,940 --> 00:07:06,800
instantly here is an example we can ask
questions about the El Nino and La Nina

96
00:07:06,800 --> 00:07:13,130
effects, so this is part of this climate
cycle in the Southern Pacific Ocean that

97
00:07:13,130 --> 00:07:16,820
occurs off the coast of Ecuador
it's a approximately four year cycle

98
00:07:16,820 --> 00:07:22,430
where this band off the coast of South
America goes through higher than average

99
00:07:22,430 --> 00:07:27,440
and lower than average temperature
gradients, so in 2015 there was a very

100
00:07:27,440 --> 00:07:32,300
intense El Nino period where the
temperature was increased by as much as

101
00:07:32,300 --> 00:07:36,050
3 degrees Celsius compared to the
average the advantage of this kind of

102
00:07:36,050 --> 00:07:40,480
technology is that you can actually
generate this anomaly data that is how

103
00:07:40,490 --> 00:07:45,340
much does the temperature differ from
historic patterns without precomputing

104
00:07:45,349 --> 00:07:51,349
it you can simply say look at past data
look at the current data and D season or

105
00:07:51,349 --> 00:07:55,490
remove this historic patterns so you can
see this dynamic data without actually

106
00:07:55,490 --> 00:07:59,840
computing or storing this anomaly data
separately so

107
00:08:00,700 --> 00:08:02,810
here for instance we have

108
00:08:02,810 --> 00:08:07,250
this a AVHRR sea surface temperature
data we're off the coast of South

109
00:08:07,250 --> 00:08:12,470
America right now and we can see that
all of this data is dynamically

110
00:08:12,470 --> 00:08:16,490
generated so you aren't using gigabytes
and gigabytes of additional data and you

111
00:08:16,490 --> 00:08:20,360
can simply zoom in and out dynamically
like you could with a NASA product like

112
00:08:20,360 --> 00:08:24,500
world view you can see how it looks you
can experiment you can move around the

113
00:08:24,500 --> 00:08:28,400
map and work really flexibly here for
instance we're just jumping day after

114
00:08:28,400 --> 00:08:31,910
day all of this imagery is generated
immediately within a hundred

115
00:08:31,910 --> 00:08:36,260
milliseconds working from the raw
unstructured data, so rather than taking

116
00:08:36,260 --> 00:08:39,979
hours to pre-process and hours to say
how am I going to solve this question

117
00:08:39,979 --> 00:08:47,209
with GPUs you're able to actually just
ask a question and then have an answer I

118
00:08:47,209 --> 00:08:50,870
think that's really essential for
visualization sometimes you also have

119
00:08:50,870 --> 00:08:53,270
more quantitative questions you want to
say

120
00:08:53,270 --> 00:08:57,440
how has this changed over time and
simply by selecting the region of

121
00:08:57,440 --> 00:09:03,860
interest you immediately have a time
series that spans as much as in this

122
00:09:03,860 --> 00:09:08,000
case about six years or four years
excuse me you can zoom in and you can

123
00:09:08,000 --> 00:09:12,440
immediately see that in the later part
of 2015 in this specific region of

124
00:09:12,440 --> 00:09:17,029
interest you have a dramatic increase in
the sea surface temperature anomaly

125
00:09:17,029 --> 00:09:20,959
again all of this is generated just from
the raw sea surface temperature data so

126
00:09:20,959 --> 00:09:24,770
we've generated the anomaly data we are
able to zoom in and out you're able to

127
00:09:24,770 --> 00:09:29,630
select regions ask questions say ok how
actually how dramatic is this El Nino

128
00:09:29,630 --> 00:09:34,520
effect does this occur and you get those
answers very quickly and so I think that

129
00:09:34,520 --> 00:09:38,510
this technologies are really a
remarkable way of improving the

130
00:09:38,510 --> 00:09:39,800
scientific process

131
00:09:41,040 --> 00:09:46,850
finally because GPUs
are most widely used for machine

132
00:09:46,850 --> 00:09:51,740
learning which has become a huge field
these days with new breakthroughs being

133
00:09:51,740 --> 00:09:58,040
done everyday in computer vision, anomaly
detection, language, video all of these

134
00:09:58,040 --> 00:10:01,790
algorithms are implemented on the GPU
and they can be computed extremely

135
00:10:01,790 --> 00:10:06,649
quickly on the GPU and so because the
GPU is actually used as part of the

136
00:10:06,649 --> 00:10:10,220
visualization pipeline new algorithms
can just be dropped in and out and

137
00:10:10,220 --> 00:10:14,990
applied as part of the process here for
instance you have on the leftmost column

138
00:10:14,990 --> 00:10:20,329
this MODIS truecolor data which is
optical data taken of the whole world

139
00:10:20,329 --> 00:10:24,140
almost every day, but there are parts of
the world where there actually is no

140
00:10:24,140 --> 00:10:27,680
data because a satellite never passed
directly over that region and the

141
00:10:27,680 --> 00:10:31,220
remarkable thing is that looking only at
the leftmost column you're able to

142
00:10:31,220 --> 00:10:37,760
predict the kinds of values that would
occur had you had ax had you had a

143
00:10:37,760 --> 00:10:41,149
satellite over that region and you're
able to fill it in as part of the

144
00:10:41,149 --> 00:10:45,380
visualization process so if you want to
see just a beautiful view of the entire

145
00:10:45,380 --> 00:10:50,660
world simply tile by tile in less than
100 milliseconds you're able to apply

146
00:10:50,660 --> 00:10:54,410
these state-of-the-art machine learning
algorithms fill in these details and

147
00:10:54,410 --> 00:10:59,570
view this the whole world with this
inferred data and you can also have more

148
00:10:59,570 --> 00:11:03,709
scientifically relevant applications
where you infer missing values in data

149
00:11:03,709 --> 00:11:07,070
you find anomalies, you say if you have
access to data

150
00:11:07,070 --> 00:11:10,990
from depths in the ocean you can fill in
those gaps by using machine learning and

151
00:11:10,990 --> 00:11:15,740
I think that having this unified
pipeline that combines visualization

152
00:11:15,740 --> 00:11:19,640
analysis and machine learning is I think
a really essential and powerful

153
00:11:19,640 --> 00:11:24,120
combination of these technologies that
makes visualization very useful

154
00:11:25,340 --> 00:11:30,300
finally and very quickly this can all be done in
the cloud so you don't actually have to

155
00:11:30,300 --> 00:11:35,320
have a GPU yourself you can simply go to
a website and view all of this data

156
00:11:35,330 --> 00:11:42,380
dynamically and in terms of speed and
performance this can be just

157
00:11:42,380 --> 00:11:46,190
dramatically faster than anything that
exists currently for generating

158
00:11:46,190 --> 00:11:51,710
individual tiles it can be as much as a
hundred times faster analytics can be as

159
00:11:51,710 --> 00:11:57,410
much as a thousand times faster using
this GPU acceleration so having this

160
00:11:57,410 --> 00:12:01,040
whole pipeline is flexible, it's powerful
and I think it allows you to answer

161
00:12:01,040 --> 00:12:06,110
questions that are either very difficult
to do or to make observations that you

162
00:12:06,110 --> 00:12:10,480
might never have thought to make because
you're able to interact seamlessly

163
00:12:10,480 --> 00:12:15,910
fluidly and answer questions almost as
soon as you have them so I want to thank

164
00:12:15,910 --> 00:12:20,300
JPL for having me the summer where I did
a lot of this work and then of course

165
00:12:20,300 --> 00:12:25,940
NASA and AGU for this sponsoring this
competition and for this award and this

166
00:12:25,940 --> 00:12:31,149
wonderful conference so thank you very
much