Elon Musk is putting his AI chips to work — and he’s catching up with Mark Zuckerberg

Elon Musk said on Monday that xAI had brought “the most powerful AI training system in the world” online.

Elon Musk might be distracted by the Brazilian Supreme Court’s decision to ban X, but he isn’t letting that stop him from pushing forward with his AI ambitions.

On Monday, the billionaire said xAI — the company he launched in 2023 — had brought a massive new training cluster of chips online over the weekend, claiming it represented “the most powerful AI training system in the world.”

The system, dubbed Colossus, was built at a site in Memphis, Tennessee, using 100,000 chips from Nvidia, specifically its H100 GPUs. Musk said the cluster was built in 122 days and would “double in size” in a few months as more GPUs are added.

This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days.

Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months.

Excellent…


— Elon Musk (@elonmusk) September 2, 2024

Though Musk confirmed the size of the cluster in July, bringing it online marks a key step for his AI ambitions and, critically, allows him to play catch-up with his Silicon Valley nemesis Mark Zuckerberg.

Zuckerberg’s and Musk’s ambitions — in Musk’s case, to turn xAI into a company that advances “our collective understanding of the universe” with its Grok chatbot — depend on high-performance GPUs, which provide the computing power required for powerful AI models.

These haven’t exactly been easy to come by, nor have they been cheap.

The hype generated around AI since the release of ChatGPT in late 2022 has left companies scrambling for Nvidia GPUs, with shortages stemming from frenzied demand and supply constraints. In some instances the chips have been sold for upward of $40,000.

Despite the barriers to access, companies have sought to secure a supply of GPUs in any way they can and put them to work to edge ahead of rivals.


Llama versus Grok

Nathan Benaich, the founder and general partner of Air Street Capital, has been tracking the number of H100 GPUs acquired by tech companies. He put Meta’s total at 350,000 and xAI’s at 100,000. Tesla, one of Musk’s other companies, was at 35,000.

In January, Zuckerberg said Meta would have a stockpile of 600,000 GPUs by the end of the year, with some 350,000 of those being Nvidia’s H100s.

Microsoft, OpenAI, and Amazon haven’t disclosed the sizes of their H100 piles.

Meta hasn’t disclosed exactly how many GPUs Zuckerberg has secured from his 600,000 target and how many have been put to use. But in a research paper published in July, Meta said the largest version of its Llama 3 large language model had been trained on 16,000 H100 GPUs. In March, the company announced “a major investment in Meta’s AI future” with two 24,000 GPU clusters to support the development of Llama 3.

It suggests xAI’s latest training cluster, with its 100,000 H100 GPUs, is much bigger than the cluster used to train Meta’s largest AI model.

The scale of the feat hasn’t been lost on the industry.

On X, a post from Nvidia’s data-center account in response to Musk said, “Exciting to see Colossus, the world’s largest GPU #supercomputer, come online in record time.”

Greg Yang, an xAI cofounder, had a more colorful response to the news that riffed on a song by the American rapper Tyga:

Hunnids, hunnids
Throwin’ hunnids, hunnids
Hunnids, hunnids
Rack city bitch
Rack rack city bitch https://t.co/1KAQUZVItH


— Greg Yang (@TheGregYang) September 2, 2024

Shaun Maguire, a partner at the venture-capital firm Sequoia, wrote on X that the xAI team now “has access to the world’s most powerful training cluster” to build the next version of its Grok chatbot. He added, “In the last few weeks Grok-2 catapulted to being roughly at parity with the state of the art models.”

But, as with most AI companies, there are big question marks over commercializing the technology. “It’s impressive xAI has been able to raise so much with Elon and make progress, but their product strategy remains unclear,” Benaich told B-17.

In July, Musk said the next version of Grok — after training on 100,000 H100s — “should be really something special.”

We’ll find out soon enough how competitive it makes him with Zuckerberg on AI.

Similar Posts

Leave a Reply