Why You Shouldn’t Fear Fast Innovation Through the Cloud

Speaker:

Please welcome to the stage Director of HPC & AI for Research at Microsoft, Tim Carroll.

Tim:

One should always start these off with good news. We’re going to end about half an hour sooner today, which means we will get to the social hour half an hour sooner, and I would like to take … Well, I will go ahead and take full credit for that, so give me some good guy points? Thank you. Thank you very much. All right.

Tim:

The other thing too, is that knowing that I was going on last, I figured out to give this talk a title that people would be like, “Huh?,” to at least get people back in the room, and so the title of this will soon be revealed. What I want to talk about today sounds like an 11 on the scale of 10 hype cycle, but if you heard a little bit of my background, which I don’t like the long introductions, but I’ve been doing this since 2001, and so I’ve lived through a couple of these technology shifts. I’ve been through the hype cycles, on the enterprise side, client server, and virtualization, all the things that were going to change the planet, and then went through Linux clusters, but I really do fundamentally believe that we’re on the verge of something, with all the talks that you heard today, of something that is going to fundamentally change how we, as a society and how we as a commercial entity perform research, and it’s going to be for the better, right? Let’s back up 100 years first. One of the early, early innovators in Numerical Weather Prediction was a researcher, Lewis Fry Richardson.

Tim:

I’ll spare you reading it because I don’t have a very good English accent, but essentially, what he was saying is that the whole trick here … Bear in mind, he’s talking about computers when computers were still people, right? The whole trick here is to figure out, “How do we figure out how to advance the computations in a way that are meaningful to society, but are also at a price point that society can afford the value that they deliver?,” and so it’s really interesting that even though he was a meteorologist at the time, he was still dialed in on this notion of, “Hey, all this science is great, but if we can’t afford it, it doesn’t do anybody any good.” For all of the emotion that seems to have gotten wrapped around cloud, which I don’t quite understand why, but it has, that this is just one more point on the journey. If Sam Altman’s conversation at lunchtime did nothing, it should.

Tim:

What I felt great about what he was talking about was that we’re just a blip, right? This is not a huge inflection point, we’re just a blip on the journey, and I really think that we should approach it as such, and so what I’d like to do for just the next 20 minutes or so, and I realized that we have a diverse audience here, and anytime you’re talking to a group of HPC folks or AI folks or whatever, there’s always the, there’s the research community, there’s the people who have to support the research community by building the systems that support them, and then there are the people who have to pay for it, right? You get those three people in the room, and it’s very difficult to talk to a room like that because they all have very different stack ranks of priorities, but we’ll give it a shot. What I’m hoping is that by the end of this, that I give each category a tool to go back and get something that they’re not getting today, or at least a different way of looking at it. What I’m going to do, credit to Earl for this slide.

Tim:

You recognize this one. This is the, from Hyperion’s research, and basically, it was a study that they did or a compilation of data that showed the growth of the processor units that were installed for what we would call HPC, starting roughly at the onset of the Linux cluster evolution, revolution, whatever you want to call it, so starting in about ’96 up through ’17. I don’t want to dive deep into exactly what the numbers are, but I just want you to take a snapshot of that for a moment, because if you see that kind of growth, that since we’re also in Silicon Valley and that there are startups here and founders and everybody else, this is the hockey stick that we, as entrepreneurs love, right? This is a unicorn. Everybody talks about the hockey stick.

Tim:

People don’t often see it, but in our space, we saw it and we experienced it, but what’s really, really important to this community is that that’s not because of the same, and I’m making up a number, the same 5,000 researchers worldwide used 1,000 times more compute. It’s because the number of people who had access to compute went from 5,000 to 50,000. That’s how you get that kind of growth and that kind of scale. The other piece of it is, is that it’s easy to turn that into a processor unit scale and all the other pieces, but there was real science behind this. For people who do what we do for a living, which is try to figure out how to make a business out of HPC, this is a hard place to make a living, but you do it because you love it, because there’s something good that happens in the other side of it.

Tim:

There’s another trend chart that gets flashed around all the time that I want people to consider here, and so after this morning’s talk about the cost of genomics, of sequencing the genome, bringing it down, and then being able to use cloud to increase the accessibility, the post-processing, this is exactly what we’re talking about, and so everybody likes to point out that we went from $100 million dollars in the first sequencing, and by 2025, we hope to have it down to $1. Typically, what they’re talking about is the cost of the sequencer and the cost of the consumables that people are doing the sequencing against, but there’s a really interesting dynamic when you put the two of these together that should provide us some indication as to how profound the impact of cloud is going to be on what we all do. Take these two together and look at the dates and how they correspond, right? In ’96 was when Don Becker and Thomas Sterling at NASA started plinking around to figure out, “Hey, is there a way to take some of the stuff that we’re going to throw out anyway, tie it together and maybe give access to compute that we couldn’t get access before?” We saw how that took off.

Tim:

In that same time period is when the cost of the sequencers and the consumables came down. Because I was running Dell’s business at the time and I know what percentage of the HPC clusters that actually were sold into academia, federal government, and pharma companies, that the vast preponderance of those were with dollars that were flowing into genomics, and so if you look at that trend line, I wonder if not for genomics and the fact that the sequencers and the consumables were now creating data that one could process, but at the same time, the accessibility to compute now gave us the ability to convert that wet lab science into meaning, that then created additional research, additional grants, additional product lines that people invested in. It really did create its own flywheel, and so what’s so exciting to me is that what happened there is really a microcosm of what I think is going to happen going forward. It’s also interesting that out of this time, there’s a person who’s at TGen today, his name is James Lowey, and he was running the infrastructure at TGen in Phoenix. He came to my team and said, “Hey, we’ve got this idea, and we think that we have the ability to put enough compute into Iraq, enough compute and storage that we could actually deploy a cluster in a clinical environment so that we could then take the sequencing and some database work to compress the amount of time that we put a treatment plan together once we have diagnosed pediatric neuroblastoma.”

Tim:

Right, and so when I talk about doing things that matter, that was this project that a bunch of people came together to work on, and we were actually able to accomplish that. Whenever I tell that story, I always feel that it’s important to point out that the Michael Dell Foundation just underwrote it. They thought this was not Dell, the company. I was working at Dell, the company, but the Michael Dell Foundation said, “Yep. We think that’s something of value because we demonstrated how we could bring technology and research together with something that was meaningful and do it in a way that we could demonstrate the success,” but in the course of that, the one piece that James had talked about …

Tim:

This is 2009. The one piece that James talked about was, so then imagine if we had one of these in each one of the cancer centers around the country or around the globe, and then we could trade that data, we could send that data back and forth with each other so that we then made our data sets even more valuable so that we could make our diagnoses more accurate, we could make our treatment plans more accurate, but at the time, we just couldn’t get there. Even though cloud had just started to reach its infancy, the pipes weren’t in place yet to be able to do it, but for me anyway, it’s what really set the hook for me that I think we’re on the precipice of something that’s going to be very important. When we take a look at … If you haven’t figured out, I built this slide.

Tim:

There is nothing artistic about it, but what it is, is that we are in a space that even though there are the folks who control the money, the folks who run the centers, and they, for the most part, get to decide what gets built and who gets to use it, and then there are the researchers, every great jump that we’ve seen in technology has been because the research community needed something that they couldn’t get. The reason we even had the Linux cluster revolution was because supercomputers were the exclusive domain of the cool kids, the 500 people who got to use them in the big government centers, and so Don Becker and Thomas Sterling were trying to solve an access problem. Then, as we went along, we figured out, “This could work. Now, we need to figure out how to solve bigger problems,” but then, once we started solving bigger problems, we were putting bigger clusters. The problem is, they were sitting in closets, under desks, they were starting fires, they were OSHA hazards, and so then we said, “Okay.”

Tim:

“Well, we need to take these and we need to roll them up, and we need to put them into data centers,” and that was the advent of the big NSF-funded academic data centers, and so if you look at each time we’ve had a jump of some sort in technology, yes. There was a technology vendor who was putting a product forward, but the needle didn’t move until the community said, “I’ve got a problem that I can’t solve today with what I’ve got,” so I think. Just my opinion. In 2007, when AWS came on the scene, and then Google, and Azure, I was still at Dell. I hadn’t left until 2013, but people started talking about cloud, but quite honestly, those center directors had done a really good job of building infrastructure that gave people access to the amount of compute that they needed, right?

Tim:

Even though the cloud providers were getting better and better at running these workloads, and you would see cool benchmarks that came out every once in a while, or you might see a corner case of Steve Lister at Novartis who did something remarkable, we really didn’t get that traction that we needed to get, and a big part of that was because the technology that was there to solve a problem that did not yet exist, and so we are in a place today where not having enough compute is a problem that exists, and it’s because AI is coming. HPC is already there. All of these people are coming to the same centers and looking for resource, and at the same time, and that’s for the center directors, are saying, “The researchers are all coming to me, and they’re asking me for things. I’m doing the math. These numbers are unattainable even if I had a blank checkbook.”

Tim:

The people with the money are saying, “You don’t have a blank checkbook,” and only do you not have a blank checkbook, but I need you to take a good, hard look at how you’re doing things because I don’t want to build another data center. I was just with a customer that I’ve worked with for 12 years last week. He said, “Hey, I’m going to get an estimate from our facilities folks, probably in about a month or two, and it’s going to tell me that it’s going to cost me $10 million to upgrade my facilities in order to be able to house the compute that I need based on the calculations that I’ve made of what we’re going to need in order to serve my agency’s mission going forward.” It’s going to cost $10 million to build the facility to house the compute that they haven’t bought yet. His challenge is that the data center money is the same as the compute money, so it’s, “Do I build a data center and not have computers, or do I put computers in a data center that can’t hold it?”

Tim:

Right? For him, this whole notion … This is a person that when I called him to tell that I was leaving Dell to do an HPC cloud thing, he said … My wife’s name is Danielle. He was like, “Did you talk to Danielle about this first, because he thought there was no way that that was going to happen successfully?”

Tim:

Right? The point of that is, is that most of the folks that I talk to, who are the people who have been around long enough and they’re sort of transparent enough to say, “Look, I’m just trying to keep the research community fed,” it has flipped from trying to build the business case to use cloud to now, trying to figure out if there’s even a business case that merits continuing down an infrastructure path. Now, nowhere in that conversation is anybody talking about closing data centers, right? What we’re talking about is, what is it that cloud can do to enable the research community to do what they cannot do today? That’s the reason for this slide.

Tim:

This is my only wonky benchmarky slide. This is a WRF-run, and so for the people who aren’t familiar WRF is a weather code. The nice thing about WRF is that there is a user base of about 40,000 people around the world who used this weather code, and it’s really, a community-based code. It’s gotten better and better over time as people, as the weather community, the Numerical Weather Prediction community has contributed back to it. The other nice thing about it is that as an indicator of HPC performance, it’s one of those codes that people run in order to really test the machine because it tests IO, and your InfiniBand, the access times and the, not only in the compute, but the latency and the interconnect, all of those other things. It’s one of those benchmark codes.

Tim:

Well, we just did a run, and we took a standard, off-the-shelf data set, called the Hurricane Maria Dataset. We used our standard libraries that we have posted on Azure.com, and we ran at 80,000 cores of NPI at near bare-metal-like performance. We can’t find anyone who’s run on-prem close to that level, let alone somebody that has run that in the cloud, but the point that’s important here is not that it’s 80, because the only reason we did 80 was because we ran out of data. We could have gone to 120, but we needed a bigger model. The other thing too, is that I fully expect that other cloud providers will come out with numbers just as astounding on other codes.

Tim:

The important part is the part that I just glossed over, which is that benchmark was done by taking a public data set and tools that are available, and there’s a user base of 40,000 people who know how to use WRF, so that in the old days, if somebody would see a benchmark like that and think, “Man, that’s awesome,” wouldn’t that be cool to be able to get access to a machine like that, to be able to run something at that scale? You can. Anybody with, in this case, an internet access and an Azure subscription could do this, and so the point in all of this is that what is going to make the impact that cloud has so profound. If we go back to James Lowey, when James Lowey talked about cloud, he didn’t talk about it as compute. He didn’t talk about it as lift and shift, or that it was going to somehow be better and cheaper than what he could do internally.

Tim:

He was looking at it as something that could give him the ability to do something he couldn’t do today or at the time, and that it was because it was the collaborative nature that cloud brought to bear, and so what’s going to happen as we go forward? The reason that at some point, Earl, hopefully you and I are retired by then, and when they put the slide that looks back on 2019 through 2029, whoever we’ve turned this over to, will put that slide together, but we’re going to look at scaling that’s beyond anything that anybody could fathom, and it’s because this is the first time … Yeah, my humble opinion, of all of the technologies that we’ve seen, sort of on the hype curve … I’ll call them the enterprise technologies. We’re not talking some of the social stuff.

Tim:

On the enterprise technologies, this is the first time that we’ve delivered a tool to anybody who’d like to access it, that gives virtually unlimited capacity because of roadmap’s forward-looking, unlimited capability, but most importantly, the third leg of the stool is the collaborative nature, and so that what we’re going to be able to do is, is that before, where we had new, cool technologies that we could enable that person and that person and that person and that person, what we’re now going to be able to do is enable all four of them, but give them the ability to amplify what they have done by working together. That has nothing to do with the technology that we deploy. We’ll just deploy the pieces, but this really is the ask. It’s to the researchers and it’s to the people that run the centers, and it’s to the people who are funding it, is understand that cloud is not a thing. It’s not a widget.

Tim:

It’s a business process. The reason I think that people are having such a hard time getting themselves there now is because this really turns upside down how people have done things for the last 15 or 20 or 25 years. Instead of starting with, “Here’s my stuff. Here’s my supercomputer, and I’m going to figure out how to get the thing that’s going to last the longest for the next five years that I can carve up most efficiently, but at the end of the day, I’m trying to solve, how many people can I make happy with a finite resource problem?,” and instead of looking at the thing, which is the infrastructure, we’re looking at each individual researcher’s workflow, and we’re essentially giving them their own environment, but still giving the folks that manage that infrastructure the ability to manage all of them in a way that ensure that it’s secure, and compliant, and paid for, and all those important things, right? The notion here is to stop thinking of cloud as a thing, and understand that it is a business process, but at the same time, once one gets their head around that, it’s so much easier to just say, “Let’s just go try it,” because there’s no trick. When somebody, whether it’s somebody from Azure, or AWS, or Google, and they say, “Look, let’s just try and go run a POC,” it’s not a trap or a trick.

Tim:

It’s just a way to say the only way to really understand this, rather than what we used to do, which was six months of analysis, this is you are so much better served running it, understanding it, iterate, run it, iterate it, and at the end of six weeks, you’re going to have the data that you need that’s hard data to be able to make the decisions that you want to in terms of charting your path forward. It might be cloud, might be on-prem, mix of some. Who knows? Who cares? Right?

Tim:

With that, I told you I did the stop dreaming thing to be catchy, but what I really mean by this is that for the last 100 years, the focus of what people like me have done for a living is trying to figure out how to bend that cost curve so that we could try and get as close as possible to making the compute cost-effective enough to be socially valuable and commercially viable, right? We have done that, right? I know what the roadmaps and the financials look like going forward from here. We have done it, and this is being recorded, so it’ll come back to me, but we’ve done it. The important thing now is let’s get past that then, and let’s think about what we can do with it. Thank you very much.

Tim Carroll

Tim Carroll is based out of Lutherville-Timonium, Maryland, United States and works at Microsoft as Director, HPC & AI for Research. Enabling the public research community to perform critical science with the increased collaboration and capability of Azure.

View all posts