July 21, 2024

MediaBizNet

Complete Australian News World

OpenAI, the humane ignore rule that prevents bots from scraping web content

OpenAI, the humane ignore rule that prevents bots from scraping web content

The world’s two largest AI startups are ignoring media publishers’ requests to stop scraping their web content for free sample training data, Business Insider has learned.

It turns out that OpenAI and Anthropic are either ignoring or circumventing a static web rule called robots.txt, which prevents automated deletion of websites.

TollBit, a startup that aims to broker paid licensing deals between publishers and AI companies, found that many AI companies were behaving this way and informed some of the major publishers in a letter on Friday, which was Reuters reported it earlier. The letter did not include the names of any of the artificial intelligence companies accused of circumventing the rule.

OpenAI and Anthropic have publicly stated that they respect the robots.txt file and block their own web crawlers, such as GTBot and ClaudeBot.

However, according to TollBit’s findings, such blocks are not being respected, as claimed. AI companies, including OpenAI and Anthropic, choose to simply “bypass” the robots.txt file in order to retrieve or extract all the content from a particular website or page.

An OpenAI spokeswoman declined to comment beyond BI’s directive to a company Blog post As of May, the company says it takes web crawler permissions “into account every time we train a new model.” An Anthropic spokesperson did not respond to emails seeking comment.

Robots.txt is one piece of code that has been used since the late 1990s as a way for websites to tell robot crawlers that they don’t want their data deleted and collected. It has been widely accepted as one of the unofficial supporting rules of the Web.

READ  Elon Musk targets Twitter with $41 billion cash acquisition offer

With the advent of generative AI, startups and technology companies are racing to build the most powerful AI models. The key ingredient is high-quality data. The thirst for such training data has undermined robots.txt and the informal conventions that support the use of this code.

OpenAI is behind the popular chatbot ChatGPT. The company’s largest investor is Microsoft. Anthropic is behind another relatively popular chatbot, Claude. Its largest investor is Amazon.

Both chatbots provide answers to user questions in a human tone. Such answers are only possible because the AI ​​models on which they are built include vast amounts of written text and data pulled from the web, most of which is under copyright or owned by its creators.

Several tech companies argued last year before the US Copyright Office that nothing on the web should be considered subject to copyright when it comes to AI training data.

OpenAI has some deals with publishers to access content, including Axel Springer, which owns BI. The US Copyright Office is set to update its guidance on artificial intelligence and copyright later this year.

Are you a tech employee or someone else who has advice or insight to share? Contact Callie Hayes on [email protected] Or on a secure messaging appSignal On +1-949-280-0267. Communicate using a non-work device.