Reddit’s CEO says Microsoft, Anthropic, and Perplexity scraping content is ‘a real pain in the ass’
Steve Huffman called out Microsoft, Anthropic, and Perplexity for using Reddit’s data to train their AI models without permission.
AI companies have been combing Reddit to train their models — and Reddit isn’t happy about it.
Steve Huffman, the CEO of Reddit, has called out Microsoft, Anthropic, and Perplexity for using Reddit’s data to train their AI models without paying.
“We’ve had Microsoft, Anthropic, and Perplexity act as though all of the content on the internet is free for them to use,” Huffman told The Verge, saying that blocking these companies has been “a real pain in the ass.”
AI companies use web crawlers — bots designed to download information from the internet — which Reddit has been trying to block, by changing its policies to stop companies that do not pay for collecting Reddit’s data, Bloomberg reported.
“When it was used for simple search, to create simple links that would send us traffic from search engines, that was fine,” Huffman told Bloomberg. “But now folks are using Reddit data for training, they’re reselling it, they’re doing search summaries instead of linking to us.”
Google is the only known major AI player to have an agreement with Reddit. The Alphabet-owned company signed a deal with Reddit’s data to train its AI model for $60 million annually, Reuters reported in February.
This isn’t the first time that Microsoft, Anthropic, and Perplexity have come under fire for training their models using data without permission.
In June, Anthropic and Microsoft-backed OpenAI were found to have violated a rule known as robots.txt, which denies web crawlers permission to access and collect content on certain websites.
While it is an unofficial rule, OpenAI and Anthropic have publicly stated that they respect robots.txt and do not collect data from websites that block their web crawlers. A spokeswoman for OpenAI declined to comment, while a spokesperson for Anthropic did not respond to emails seeking comment.
Microsoft’s head of search posted on X earlier this week that Reddit blocked Bing, “favoring another engine.” Microsoft did not immediately respond to requests for comment.
Perplexity also got in hot water for plagiarizing several news outlets. In June, Forbes accused Perplexity of “ripping off” several articles from various publications in its own AI-generated podcast and stories without properly attributing its sources.
The AI search engine was also found to have violated the robots.txt rule and was “paraphrasing WIRED stories, and at times summarizing stories inaccurately and with minimal attribution,” wrote Wired in a June investigation.
Perplexity did not respond to a request for comment from B-17
Debates about copyright and paying for data to train AI models have been rife among major AI players. Leading venture-capital firm Andreessen Horowitz said last year that paying for data would cost developers “tens or hundreds of billions of dollars a year in royalty payments” and put a dent in AI investments.
While Meta has also been skeptical about striking deals for data, the company has considered deals with news publishers to access news and media content, B-17 reported in May.
B-17 parent company Axel Springer inked a deal last year with OpenAI to use content from brands including B-17 and Politico to train its AI models.