Nvidia’s Blackwell cooling issues were resolved months ago, chip experts say
The rollout of Nvidia’s most advanced chip ever, Blackwell, has been met with both excitement and anxiety.
First, there were chip design issues, which CEO Jensen Huang has since said are fully resolved. Now cooling is top of mind after a report from The Information detailed worries from a few Nvidia customers that the larger configurations of Blackwell chips, the most advanced on the market, were overheating.
Dylan Patel, chief analyst at Semianalysis told B-17 that though cooling will be a major concern as Blackwell ramps up, and for all the chips after, the Blackwell design issues related to cooling are resolved.
“I think the overheating issues have been present for months and they have largely been addressed,” Patel told Insider. Rumors of overheating chips go back to the summer, he said.
“When we track them down, these are overblown,” Patel posted on X in August.
In August, Semianalysis, which has a team of a more than dozen experts monitoring every aspect of the semiconductor industry, reported that issues had arisen with the cooling systems, triggering reworks from several suppliers. The team of five analysts on the August report called the changes “minor.”
The immense computing power that Nvidia advertises for Blackwell, or any chip, depends on the system design and how it’s set up and used. Mileage may vary based on how the chips are installed, networked together, cooled, and programmed.
Though it can be power-hungry and, therefore costly, effective cooling has a huge impact on the data center operators’ ability to turn a profit.
Overheated chips simply stop working until they cool down, and the ability to keep them cool impacts how much computing each chip can do in a given period of time. Downtime and cooling costs impact the total cost of ownership for the useful life of the chip.
‘Teething issues’
Much of the recent concern has been directed at the GB200 NVL72, which represents a new frontier for data centers. The ’72’ in the moniker comes from 72 Blackwell graphics processing units in the server, in addition to 36 traditional central processing units.
Because so many chips are grouped together so tightly to function as one superchip, the towers get very hot and require liquid cooling.
Liquid cooling isn’t new, but doing it at a datacenter scale has been relatively rare to date. As hyperscalers and a select few other Nvidia customers start to receive their chips through the rest of 2024 and in the first half of 2025, adjustments must be made, and in some cases, new buildings brought online.
Meta has reportedly redone its data center design to account for the increased power density and cooling needs of future generations of AI chips.
“There are going to be teething issues,” Patel said. “People just don’t have best practices — everyone is learning how to do it all together,” he continued.
New data centers will be built for liquid cooling, but many existing facilities are being retrofitted. This is a fairly demanding task. In addition to all the components fitting perfectly to avoid a single drop of leakage, liquids must circulate at precise temperatures.
The transition is likely to have more awkward moments, Patel told B-17.
“Engineering iterations are normal and expected,” a Nvidia spokesperson told B-17.
In addition to engineering and operational challenges, liquid cooling at scale brings with it a list of environmental concerns. An internal document from Amazon, obtained by B-17 Chief Correspondent Eugene Kim, said Amazon is “straining local jurisdictions’ existing infrastructure” for water in some regions and is “dependent on long-term infrastructure upgrades or building our own solutions” to mitigate the issue. (An Amazon spokesperson told B-17 its data centers were more water efficient than the industry average.)
Despite the hard work and environmental strain, of converting to liquid cooling, the incentives are strong.
“Any datacenter that is unwilling or unable to deliver on higher density liquid cooling will miss out on the tremendous performance TCO improvements for its customers and will be left behind in the Generative AI arm’s race,” Seminanalysis analysts wrote in October.
Cooling companies are gearing up for the shift. Vertiv stock hit an all-time high Tuesday.
Patel said earlier issues with Blackwell’s chip design and the challenges of installing Blackwell systems have tapered the number of chips shipping this year. Semianalysis has estimated that 200,000 will ship this calendar year, mostly to hyperscalers. Smaller cloud computing companies are expecting to receive chips in the Spring of 2025, though some are still uncertain as to how many they will be allocated and when they might receive them.
Despite multiple challenges, Blackwell is scaling fast, Patel said
Huang promised billions in revenue from Blackwell this year on the company’s August earnings and that’s what the company is likely to focus on during Wednesday’s call.
Nvidia’s stock fell Monday following the overheating report. By midday Tuesday it had climbed above its price at Friday’s market close.