Amazon’s AI chip executive tells B-17 why Nvidia is not a competitor, how Anthropic helps, and what AMD needs

Gadi Hutt, AWS Annapurna Labs’ senior director of customer and product engineering.

Amazon Web Services launched an upgraded line of AI chips this week, putting the company squarely in competition with Nvidia.

Except AWS doesn’t see it that way.

AWS’s new AI chips aren’t meant to go after Nvidia’s lunch, said Gadi Hutt, a senior director of customer and product engineering at the company’s chip-designing subsidiary, Annapurna Labs. The goal is to give customers a lower-cost option, as the market is big enough for multiple vendors, Hutt told B-17 in an interview at AWS’s re:Invent conference.

“It’s not about unseating Nvidia,” Hutt said, adding, “It’s really about giving customers choices.”

AWS has spent tens of billions of dollars on generative AI. This week the company unveiled its most advanced AI chip, called Trainium 2, which can cost roughly 40% less than Nvidia’s GPUs, and a new supercomputer cluster using the chips, called Project Rainier. Earlier versions of AWS’s AI chips had mixed results.

Hutt insists this isn’t a competition but a joint effort to grow the overall size of the market. The customer profiles and AI workloads they target are also different. He added that Nvidia’s GPUs would remain dominant for the foreseeable future.

In the interview, Hutt discussed AWS’s partnership with Anthropic, which is set to be Project Rainer’s first customer. The two companies have worked closely over the past year, and Amazon recently invested an additional $4 billion in the AI startup.

He also shared his thoughts on AWS’s partnership with Intel, whose CEO, Pat Gelsinger, just retired. He said AWS would continue to work with the struggling chip giant because customer demand for Intel’s server chips remained high.

Last year AWS said it was considering selling AMD’s new AI chips. But Hutt said those chips still weren’t available on AWS because customers hadn’t shown strong demand.

This Q&A has been edited for clarity and length.

There have been a lot of headlines saying Amazon is out to get Nvidia with its new AI chips. Can you talk about that?

I usually look at these headlines, and I giggle a bit because, really, it’s not about unseating Nvidia. Nvidia is a very important partner for us. It’s really about giving customers choices.

We have a lot of work ahead of us to ensure that we continuously give more customers the ability to use these chips. And Nvidia is not going anywhere. They have a good solution and a solid road map. We just announced the P6 instances [AWS servers with Nvidia’s latest Blackwell GPUs], so there’s a continuous investment in the Nvidia product line as well. It’s really to give customers options. Nothing more.

Nvidia is a great supplier of AWS, and our customers love Nvidia. I would not discount Nvidia in any way, shape, or form.

So you want to see Nvidia’s use case increase on AWS?

If customers believe that’s the way they need to go, then they’ll do it. Of course, if it’s good for customers, it’s good for us.

The market is very big, so there’s room for multiple vendors here. We’re not forcing anybody to use those chips, but we’re working very hard to ensure that our major tenets, which are high performance and lower cost, will materialize to benefit our customers.

Does it mean AWS is OK being in second place?

It’s not a competition. There’s no machine-learning award ceremony every year.

In the case of a customer like Anthropic, there’s very clear scientific evidence that larger compute infrastructure allows you to build larger models with more data. And if you do that, you get higher accuracy and more performance.

Our ability to scale capacity to hundreds of thousands of Trainium 2 chips gives them the opportunity to innovate on something they couldn’t have done before. They get a 5x boost in productivity.

Is being No. 1 important?

The market is big enough. No. 2 is a very good position to be in.

I’m not saying I’m No. 2 or No. 1, by the way. But it’s really not something I’m even thinking about. We’re so early in our journey here in machine learning in general, the industry in general, and also on the chips specifically, we’re just heads down serving customers like Anthropic, Apple, and all the others.

We’re not even doing competitive analysis with Nvidia. I’m not running benchmarks against Nvidia. I don’t need to.

For example, there’s MLPerf, an industry performance benchmark. Companies that participate in MLPerf have performance engineers working just to improve MLPerf numbers.

That’s completely a distraction for us. We’re not participating in that because we don’t want to waste time on a benchmark that isn’t customer-focused.

On the surface, it seems like helping companies grow on AWS isn’t always beneficial for AWS’s own products because you’re competing with them.

We are the same company that is the best place Netflix is running on, and we also have Prime Video. It’s part of our culture.

I will say that there are a lot of customers that are still on GPUs. A lot of customers love GPUs, and they have no intention to move to Trainium anytime soon. And that’s fine, because, again, we’re giving them the options and they decide what they want to do.

Do you see these AI tools becoming more commoditized in the future?

I really hope so.

When we started this in 2016, the problem was that there was no operating system for machine learning. So we really had to invent all the tools that go around these chips to make them work for our customers as seamlessly as possible.

If machine learning becomes commoditized on the software and hardware sides, it’s a good thing for everybody. It means that it’s easier to use those solutions. But running machine learning meaningfully is still an art.

What are some of the different types of workloads customers might want to run on GPUs versus Trainium?

GPUs are more of a general-purpose processor of machine learning. All the researchers and data scientists in the world know how to use Nvidia pretty well. If you invent something new, if you do that on GPU, then things will work.

If you invent something new on specialized chips, you’ll have to either ensure compiler technology understands what you just built or create your own compute kernel for that workload. We’re focused mainly on use cases where our customers tell us, “Hey, this is what we need.” Usually the customers we get are the ones that are seeing increased costs as an issue and are trying to look for alternatives.

So the most advanced workloads are usually reserved for Nvidia chips?

Usually. If data-science folks need to continuously run experiments, they’ll probably do that on a GPU cluster. When they know what they want to do, that’s where they have more options. That’s where Trainium really shines, because it gives high performance at a lower cost.

AWS CEO Matt Garman previously said the vast majority of workloads will continue to be on Nvidia.

It makes sense. We give value to customers who have a large spend and are trying to see how they can control the costs a bit better. When Matt says the majority of the workloads, it means medical imaging, speech recognition, weather forecasting, and all sorts of workloads that we’re not really focused on right now because we have large customers who ask us to do bigger things. So that statement is 100% correct.

In a nutshell, we want to continue to be the best place for GPUs and, of course, Trainium when customers need it.

What has Anthropic done to help AWS in the AI space?

They have very strong opinions of what they need, and they come back to us and say, “Hey, can we add feature A to your future chip?” It’s a dialogue. Some ideas they came up with weren’t feasible to even implement in a piece of silicon. We actually implemented some ideas, and for others we came back with a better solution.

Because they’re such experts in building foundation models, this really helps us home in on building chips that are really good at what they do.

We just announced Project Rainier together. This is someone who wants to use a lot of those chips as fast as possible. It’s not an idea — we’re actually building it.

Can you talk about Intel? AWS’s Graviton chips are replacing a lot of Intel chips at AWS data centers.

I’ll correct you here. Graviton is not replacing x86. It’s not like we’re yanking out x86 and putting Graviton in place. But again, following customer demand, more than 50% of our recent landings on CPUs were Graviton.

It means that the customer demand for Graviton is growing. But we’re still selling a lot of x86 cores too for our customers, and we think we’re the best place to do that. We’re not competing with these companies, but we’re treating them as good suppliers, and we have a lot of business to do together.

How important is Intel going forward?

They will for sure continue to be a great partner for AWS. There are a lot of use cases that run really well on Intel cores. We’re still deploying them. There’s no intention to stop. It’s really following customer demand.

Is AWS still considering selling AMD’s AI chips?

AMD is a great partner for AWS. We sell a lot of AMD CPUs to customers as instances.

The machine-learning product line is always under consideration. If customers strongly indicate that they need it, then there’s no reason not to deploy it.

And you’re not seeing that yet for AMD’s AI chips?

Not yet.

How supportive are Amazon CEO Andy Jassy and Garman of the AI chip business?

They’re very supportive. We meet them on a regular basis. There’s a lot of focus across leadership in the company to make sure that the customers who need ML solutions get them.

There’s also a lot of collaboration within the company with science and service teams that are building solutions on those chips. Other teams within Amazon, like Rufus, the AI assistant available to all Amazon customers, run entirely on Inferentia and Trainium chips.

Similar Posts

Leave a Reply