The Internet Archive’s Wayback Machine is facing an awkward consequence of the AI boom: the same public record it has spent decades preserving is increasingly being treated by publishers as a potential source of training data.

According to Nieman Lab, 241 news sites in nine countries now block at least one of the Internet Archive’s four crawling bots, with outlets including The New York Times and Reddit among them. The Guardian has taken a different approach, not blocking the crawlers outright but limiting what appears through the Archive’s interface and API, making archived versions harder for ordinary users to find and use.

The shift reflects a broader fear that large language models are being trained on archived material without permission. Reported concerns over AI scraping have also helped drive similar restrictions at other sites, and a separate U.S. court ruling against Anna’s Archive has reinforced how aggressively copyright and anti-circumvention claims are now being tested in cases involving scraped digital material. Yet the Internet Archive says its systems are built for preservation and public access, not industrial-scale data harvesting.

Mark Graham, who directs the Wayback Machine, has argued that the archive already uses controls to curb abuse and block large-scale extraction. In comments reported by Nieman Lab and elsewhere, he and other defenders of the Archive have warned that punishing preservation tools for the behaviour of AI firms risks damaging journalism, research and historical accountability instead.

That concern is not abstract. The Internet Archive’s backers point out that many publishers rely on it when their own content disappears, changes or is removed entirely. If major news organisations continue to withdraw access, the result could be a thinner historical record of the web, with fewer traces of events, reporting and public debate available to researchers and the public.

Source Reference Map

Inspired by headline at: [1]

Sources by paragraph:

Source: Noah Wire Services