Firms like Meta and A16z admit having to pay billions for training data would ruin their generative-AI plans as they fight new copyright rules
- The US Copyright Office is considering updating its laws to deal directly with generative AI.
- Meta, Microsoft, OpenAI, and others with a stake in AI have pushed back hard against any changes.
- Paying for data would mean “tens or hundreds of billions” in yearly royalty fees, A16z said.
The world’s largest technology companies do not want to be forced to pay for the massive amounts of copyrighted data required to train the models at the heart of their generative artificial intelligence tools.
Companies such as Meta, Microsoft, Google, Apple, OpenAI, and Andreessen Horowitz were among the nearly 11,000 commenters during the ongoing comment period opened by the United States Copyright Office as it considers new rules for generative artificial intelligence. Other commentators included news organizations, media agencies, and industry professionals. The Copyright Office requested feedback in its notice on the development of a licensing regime or another process that would “remunerate copyright owners and/or creators for the use of their works in training AI models.” This was one of several possibilities.
The vast majority of tech companies appeared to believe that requiring them to pay for massive amounts of copyrighted content scraped from the internet and used to train large language models behind AI tools like Meta’s Llama, Google’s Bard, and OpenAI’s ChatGPT would make development impossible.
According to Meta’s comment, “generative AI models require not only a massive quantity of content, but also a large diversity of content.” “Without a doubt, those responsible for AI development may reach agreements with specific rights holders in order to establish larger partnerships or simply to buy peace from the threat of legal action.” However, even a small portion of the data required by AI developers to train their models would not be included in the types of deals being discussed here. Furthermore, those working on AI would be unable to obtain licenses for other important categories of works.
A similar line of reasoning was presented by Google, Microsoft, and OpenAI, namely that the amount of data used to train their models is so massive that the companies will never be able to pay for it. None of the companies denied using material protected by intellectual property rights without the owners’ permission. Instead, they generally argued that making copyrighted material available to the public on the internet makes it “publicly available” and thus free for anyone to use however they see fit. The “fair use” of those data in the context of training an LLM, according to the companies, is legal under current copyright law.
Google called the process of using copyrighted material to train artificial intelligence tools like Bard “knowledge harvesting,” and it argued that current copyright law allows for such harvesting. Holding a developer like Google liable for the use of copyrighted material in training “would impose crushing liability on AI developers,” the company argued, adding that generative AI was about the “free flow of ideas.”
Furthermore, Andreessen Horowitz, also known as A16z, believes that the billions of dollars that it and other investors have poured into the AI craze should be reason enough not to create any new rules aimed at benefiting copyright holders. Andreessen Horowitz, as well as other investors, have poured billions of dollars into the AI craze.
This investment is “premised on the understanding that, under current copyright law, any copying necessary to extract statistical facts is permitted,” according to A16z. Not having that understanding or assumption, according to the company, “will jeopardize future investment” in AI. Furthermore, it argued that any licensing regime for the use of copyrighted work in AI was illogical due to the potentially enormous amount of money owed to content owners if such a regime were implemented. This was due to the fact that AI was becoming more common.
“Under any licensing framework that provided for more than negligible payment to individual rights holders,” A16z points out, “AI developers would be liable for tens or hundreds of billions of dollars a year in royalty payments.”
Meanwhile, the majority of organizations and individuals involved in the creation of content used in the training of AI models, such as News Corp., Getty, and WME, as well as the creator of “Breaking Bad,” Vince Gilligan, have advocated for updated copyright rules to provide protection and payment from AI tools.
Because copyright law does not address the issue, there is currently little that can be done to prevent copyrighted content from being crawled from the internet and used to create an LLM. Authors, visual artists, and even developers have started suing companies like OpenAI, Microsoft, and Meta, claiming that their original work was used without their permission to train the AI tools developed by those companies.