Get your free copy of Editor’s Digest
FT editor Roula Khalaf picks her favourite stories in this weekly newsletter.
Artificial intelligence startup Anthropic is accused of aggressively harvesting data from websites to train its systems, potentially violating publishers’ terms of service in the process, according to people familiar with the matter.
AI developers rely on ingesting vast amounts of data extracted from various sources to create large-scale language models, the technology behind chatbots such as OpenAI’s ChatGPT and Anthropic’s rival Claude.
Anthropic was founded by a group of former OpenAI researchers with the promise of developing “responsible” AI systems.
But Freelancer.com CEO Matt Barry accused the San Francisco-based company of being the “most aggressive scraper ever” on the freelance portal, which receives millions of daily visitors.
Other web publishers have echoed Burry’s concerns that Anthropic is flocking to their sites and ignoring directives to stop harvesting content to train models.
According to data obtained by the Financial Times, Freelancer.com received 3.5 million hits in a four-hour period from web “crawlers” linked to Anthropic. Barry said Anthropic has “probably about five times the volume of the next-largest AI crawler.”
Search engines have always done a lot of scraping, but training generative AI has taken it to the next level.
He added that the number of visits from Freelancer.com’s bots continued to grow even after the company tried to deny access requests using standard web protocols that direct crawlers. Barry then decided to block traffic from Anthropic’s Internet addresses entirely.
“They don’t follow the rules of the internet, so we have to block them,” Barry said. “This is nasty scraping.” [which] It slows down your site for everyone who interacts with it, which ultimately impacts your bottom line.”
Anthropic said it was investigating the matter and was respecting the publisher’s request and trying not to “interfere or disrupt.”
Scraping publicly available data from across the web is generally legal, but the practice is controversial, may violate website terms of service, and may result in costs for site hosts.
Kyle Wiens, CEO of iFixit.com, said his electronics repair site received 1 million hits from the Anthropic bot in a 24-hour period. “We’re getting a lot of warnings. [for high traffic]”It’s waking people up at 3 a.m. It’s going off every single alarm we have,” he said.
iFixit’s terms of service prohibit the company from using its data for machine learning, Wiens said. “My first message to Anthropic is that if they’re using this to train models, that’s illegal. Secondly, this is not polite internet behavior. Crawling is etiquette.”
Websites use a protocol called “robots.txt” to prevent crawlers and other web robots from accessing parts of their site, but this relies on voluntary compliance.
“We respect robots.txt, and when iFixit implemented it, our crawlers respected that signal,” Anthropic said. The company also said its crawlers respect “anti-circumvention techniques” such as CAPTCHAs, and that “our crawls must not be intrusive or disruptive. We aim to minimize disruption by carefully considering how quickly we crawl the same domains.”
Data scraping is not a new technique, but it has increased dramatically over the past two years as a result of the AI arms race, which has created new costs for websites.
“AI crawlers have cost us a lot of money in bandwidth fees and a lot of time in dealing with abuse,” Eric Holscher, co-founder of document hosting site Read the Docs, said in a blog post on Thursday. “AI crawlers are acting with disrespect for the sites they crawl and will spark a backlash against AI crawlers in general,” he added.
Anthropic has created some of the world’s most advanced chatbots, capable of responding to a range of prompts in natural language – comparable to OpenAI’s ChatGPT – and has positioned itself as a more ethical actor than some of its rivals. Anthropic’s stated purpose is to “responsibly develop and sustain advanced AI for the long-term benefit of humanity.”
As big AI companies race to create ever more capable and clever models, they are partnering with publishers and creating synthetic training data to push ever deeper into unchartered territories of the web.
OpenAI has signed a number of deals with publishers and content providers in recent months, including Reddit, The Atlantic and The Financial Times. Anthropic has not publicly announced a similar partnership.
“Search engines have always done a lot of scraping,” Barry says, “but training generative AI has taken it to another level.”
iFixit’s mission is “to inform and inspire people to fix things themselves,” Wiens said. “We’re not opposed to them using our content to train their models. We just want to be part of the conversation.”
He added: “I’m not campaigning on this issue, I’m just trying to keep the website online.”