Freelancer has accused Anthropic, the AI startup that developed the Claude large-scale language model, of scraping data from its website, ignoring the company’s “no-crawl” robots.txt protocol. Meanwhile, iFixit CEO Kyle Wiens said Anthropic ignored the website’s policies that prohibit it from using content to train AI models. Freelancer CEO Matt Barry told The Information that Anthropic’s ClaudeBot is the “most aggressive scraper we’ve seen.” His website reportedly received 3.5 million hits from the company’s crawler within four hours, “probably about five times more than the next-largest AI crawler.” Similarly, Wiens posted on X/Twitter that Anthropic’s bot hit iFixit’s servers 1 million times in 24 hours. “Not only are you taking our content without paying for it, you are tying up our development resources,” he wrote.
In June, Wired accused another AI company, Perplexity, of crawling its website despite the presence of the Robots Exclusion Protocol (robots.txt). A robots.txt file typically contains instructions on which pages a web crawler can and cannot access. Compliance is voluntary, but bad bots often ignore it. After the Wired article was published, TollBit, a startup that connects AI companies with content publishers, reported that Perplexity is not the only one circumventing robots.txt signals. While not naming any companies, Business Insider reported that OpenAI and Anthropic were also found to be ignoring the protocol.
Barry said Freelancer initially tried to deny the bot’s access requests, but eventually had to block Anthropik’s crawlers altogether. “This is nasty scraping,” he said. [which] “It slows down the site for everyone who interacts with it, which ultimately impacts revenue,” he added. As for iFixit, the site said it sets alarms for high traffic and that Anthropic’s activity woke staff up at 3 a.m. The company’s crawlers stopped scraping iFixit after it added a line to its robots.txt file that specifically banned Anthropic bots.
The AI startup told The Information that it respects robots.txt and that its crawlers “honored iFixit’s signals when they implemented it.” It also said, “How quickly [it crawls] The agency believes the “same domain” is being used and is currently investigating the incident.
AI companies use crawlers to collect content from websites and use it to train their generative AI techniques. As a result, they have been accused of copyright infringement by publishers and have been the target of multiple lawsuits. Companies such as OpenAI are signing deals with publishers and websites to prevent further lawsuits. So far, OpenAI’s content partners include News Corp, Vox Media, Financial Times, Reddit, and others. iFixit’s Wiens also seems open to striking deals for articles on the how-to-repair website, saying in a tweet to Anthropic that he is open to discussing licensing the content for commercial use.
If any of these requests lead you to our Terms of Use, you will be informed that the use of our content is expressly prohibited. But don’t ask me, ask Claude.
If you would like to discuss licensing any of our content for commercial use, please contact us here. pic.twitter.com/CAkOQDnLjD
— Kyle Wiens (@kwiens) July 24, 2024