gossipify logo 1

Amazon invests US$2.75 billion in AI

Amazon invests US$2.75 billion in AI! If the Microsoft owns half of OpenAI that didn’t come out, so Amazon and its cloud division Amazon Web Services need something similar. That is, half of the OpenAI that came out – meaning the Anthropic. This means that Amazon needs to invest a lot more money than the Googlewhich also invested in Anthropic. But the Google also has its own Gemini LLMand is expected to have more leverage – and get GPU system rents in return.

We live in strange times. A Microsoft invest US$13 billion in OpenAI – with a pledge of $10 billion last year. Now the Amazon will fulfill its promise to invest US$4 billion in Anthropic. This is because it will inject the second installment of US$2.75 billion. Thus, it executes a brilliant way to acquire a stake in any AI startup. You get access to the startup’s models, understand where they’re headed, and be the first to market your products at scale.

Amazon and Microsoft investing heavily in AI

We’d love to see how Microsoft and the OpenAI and the Amazon and the Anthropic are recording these investments. Also, like licensing LLMs and renting machines to train them and run them in production as part of the products. There is a danger of this appearing like “roundtripping”. Such that money simply moves from the IT giant to the AI ​​startup as an investment. Then it’s back to the IT giant. It would be enlightening to see how these agreements are actually structured.

If you could get away with it, you would. And if you are Amazon/AWS or Microsoft/Azurewould give Anthropic Or the OpenAI large amounts of money. Furthermore, knowing full well that the vast majority would come back as cloud reserved GPU instance sales. Well, in the case of OpenAI, some of that money can be used to create custom ASICs for AI inference and training. . . . that Microsoft is already doing with its Maia 100 line of chips.

Time is money

How much money are we talking about to train these AI models? Very. In his keynote address at the GTC 2024 conference last week, co-founder and CEO of Nvidia, Jensen Huang, said something interesting. This is because it confirmed that it took 90 days and 15 megawatts of energy to train a 1.8 trillion parameter LLM GPT-4 Mixture of Experts OpenAI. The system used was a cluster of SuperPODs based on H100 “Hopper” GPUs. Additionally, it used InfiniBand outside the node and NVLink 3 inside the node to create an eight-CPU HGX compute complex.

Microsfot

Microsoft doesn’t yet provide pricing for its ND H100 v5 instances, but it’s $27,197 per hour for the on-demand eight-GPU ND96asr A100 v4 instance and $22.62 per instance for a one-year reserved “savings” plan.

AWS

At the AWS, the p4de.24xlarge A100 instance, which is based on the same eight-lane HGX A100 complex as Microsoft’s ND96asr A100 v4 instance, costs $40.96 per hour on demand and $24.01 per hour with one year of reserved capacity. On the new p5.48xlarge instance based on the H100s, released last July and with essentially the same architecture, we believe it costs $98.32 per hour with an eight-GPU HGX H100 compute complex, and we believe a one-year reserved instance costs $57.63.

At the AWS, to train GPT-4 1.8T MoE in 90 days, you need 1,000 of these HGX H100 nodes. If you had 100 nodes, it would take about 90 days and if you had 3,000 nodes, or about 24,000 H100 GPUs, it would take about 30 days. Time is definitely money here. With 1,000 nodes, a cluster of 8,000 H100 GPUs using AWS p5.48xlarge instances would cost $124.5 million. If Microsoft maintains its pricing advantage over AWS, its instances will be 5.8% cheaper, or $117.3 million. To train the GPT-4 1.8T MoE in 30 days instead of 90 days, you’re talking about $351.8 million.

This is the cost for one model, one time. A OpenAI and the Anthropic – and others – need to train many LLMs on a large scale over a long time. They need to test new ideas and new algorithms. At these prices, $4 billion only covers the cost of training about three dozen 2 trillion parameter LLMs over a 90-day period. Over time, and as the company’s customer base Anthropic grows and the use of Claude LLMs by Amazon grows, inference will be a larger part of that budget.

Partners with deep pockets

You can see why anyone training generative AI at scale needs a partner with lots of hardware and deep pockets. This explains why Google, Microsoft, AWS It is Meta Platforms will dominate the cutting edge of models and why Cerebras Systems needs a partner (G42) with deep pockets to help it build $1 billion worth of systems to validate its systems architecture. As you can see, $1 billion or $4 billion or $13 billion doesn’t go very far in the cloud. And that’s why Microsoft It is AWS are investing in OpenAI and on Anthropic.

Here’s something else to consider: it only costs about $1.2 billion to own a machine with 24,000 H100 GPUs, so if you run the GPT-4 1.8T MoE model four times in a cloud, you might as well buy the hardware yourself. Oh wait, you can’t. Hyperscalers and cloud builders have a stranglehold on GPU provisioning; they’re at the front of Nvidia’s GPU allocation queue, and you certainly aren’t.

If you do the math on the cost of a 24,000 H100 system, that works out to about $46 per HGX H100 instance per hour to own it. Of course, this hourly price does not include the cost of the data center around the machine, the power and cooling for that machine, the management of the cluster, or the system software to prepare it to run LLMs.

Blackwell enters the scene

Now, with GPUs “Blackwell” recently launched, Huang told CNBC that these devices would cost between $30,000 and $40,000, which we assume is an Nvidia list price, not the street price that will certainly be higher. We thought the H100’s list price was about $21,500 and that the higher-end B200 will offer about 2.5 times the performance with the same accuracy. That’s 2.5x the firepower for the price of 1.9x, which is just a 26% improvement in price/performance with the same accuracy on tensor cores. Obviously, if you switch from FP8 to FP4 accuracy for inference, then that’s a 63% price/performance increase.

All of this explains why cloud builders and dominant AI startups are not-so-strange bedfellows. As part of the agreement, the Anthropic will port its Claude LLM family to the Trainium and Inferentia custom ASICs. This is because they were designed by AWS for internal use and for cloud customers. A AWS claims it can reduce the cost of running generative AI by 50% or more compared to GPUs. So this will clearly help not only drive the adoption of generative AI in service Bedrockwhich now has over 10,000 customers and which just went into production in October 2023, but also give parent company Amazon an LLM – Claude 3 and later generations – and a hardware platform that it controls and can use to add GenAI to many of its applications and lines of business.

Strategic AI Partnerships

As we think about cloud builders and their AI LLM startup partners, we wonder if there isn’t another kind of financial back-and-forth going on.

If we were OpenAI or Anthropic, not only would we like to license our LLMs to Microsoft and AWS, but we could also ask for a revenue sharing agreement as GenAI functions are added to the applications. This can be difficult to manage and police, of course, but we think that ultimately token generation counting will have to be done so that revenue can scale with usage. We do not think that the OpenAI Or the Anthropic be so foolish as to give your cloud partners cheap, perpetual licenses for your LLMs. But then again, with GPU hardware being so hard to find at scale, maybe they didn’t have a choice for these special cases.

Source: Atrevida

You may also like

Hot News

TRENDING NEWS

SUBSCRIBE

Join our community of like-minded individuals and never miss out on important news and updates again.

follow us