OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

Sam Altman, CEO of OpenAI.

The world’s top two AI startups are ignoring requests by media publishers to stop scraping their web content for free model training data.

OpenAI and Anthropic have been found to be either ignoring or circumventing an established web rule, called robots.txt, that prevents automated scraping of websites.

TollBit, a startup aiming to broker paid licensing deals between publishers and AI companies, found several AI companies are acting in this way and informed certain large publishers in a Friday letter, which was reported earlier by Reuters. The letter did not include the names of any of the AI companies accused of skirting the rule.

OpenAI and Anthropic have stated publicly that they respect robots.txt and blocks to their specific web crawlers, GPTBot and ClaudeBot.

However, according to TollBit’s findings, such blocks are not being respected, as claimed. AI companies, including OpenAI and Anthropic, are simply choosing to “bypass” robots.txt in order to retrieve or scrape all of the content from a given website or page.

Spokespeople for OpenAI and Anthropic didn’t respond to requests for comment on Friday.

Robots.txt is a single bit of code that’s been used since the late 1990s as a way for websites to tell bot crawlers they don’t want their data scraped and collected. It was widely accepted as one of the unofficial rules supporting the web.

With the rise of generative AI, startups and tech companies are racing to build the most powerful AI models. A key ingredient is high-quality data. The thirst for such training data has undermined robots.txt and the unofficial agreements supporting the use of this code.

OpenAI is behind the popular chatbot ChatGPT. The company’s largest investor is Microsoft. Anthropic is behind another relatively popular chatbot, Claude. It’s largest investor is Amazon.

Both chatbots serve up answers to user questions in the tone of a human. Such answers are only possible because the AI models they are built on include massive amounts of written text and data scraped from the web, much of it under copyright or otherwise owned by creators.

Several tech companies last year argued to the US Copyright Office that nothing on the web should be considered under copyright when it comes to AI training data.

OpenAI has struck a few deals with publishers for access to content, including Axel Springer. The US Copyright Office is set to update its guidance on AI and copyright later this year.