Politics

Major publishers restrict Internet Archive over AI training fears, risking loss of web history

Saturday, 18 April 2026 8:58AM UTC

As news outlets block the Wayback Machine amid fears of unauthorised AI use, critics warn that crucial web history could vanish, challenging the balance between copyright interests and digital preservation.

The Internet Archive's Wayback Machine has long functioned as the web’s institutional memory, preserving pages that might otherwise disappear through deletion, redesign or pressure from powerful interests. In the Philippines, the archive helped сохранить material linked to the controversy over articles about Senator Tito Sotto and Pepsi Paloma after those pages were taken down, underlining how easily the online record can be altered once a publisher removes content.

That role is now being strained by an argument that has nothing to do with old links and everything to do with artificial intelligence. According to reporting by WIRED, a growing number of major news organisations are blocking the Internet Archive’s crawler because they fear archived material is being used by AI companies to train models without permission. The New York Times has said the issue is that its archived content is being used in ways that violate copyright law and compete with its business.

The scale of the shift is significant. Early this year, Nieman Lab found that 241 outlets across nine countries were restricting at least one of the archive’s bots, while other reports said the blockage now includes major publishers such as USA Today Co., The New York Times, Reddit and, in a more limited form, The Guardian. By early 2026, the Wayback Machine had passed one trillion archived pages, according to reporting cited by Rappler, but that expanding library is increasingly being fenced off just as demand for historical web records remains strong.

The contradiction is sharpest where publishers themselves rely on the archive. USA Today has used the Wayback Machine in reporting on changes to US Immigration and Customs Enforcement detention statistics, even as its parent company blocks the archive from preserving its own pages. Mark Graham, who leads the Wayback Machine, told WIRED that publishers can benefit from the archive’s records while limiting access to them, describing the broader standoff with AI firms as collateral damage.

There is, however, a real reason publishers are nervous. A Washington Post analysis found that Internet Archive data has appeared in major training datasets, and that has reinforced fears that publicly archived pages can be repurposed downstream for commercial AI systems. Industry observers say the problem is not simply technical but economic: as search summaries and chatbots reshape how audiences reach news, publishers are looking for ways to stop their content being reused without compensation.

Against that backdrop, digital-rights groups are urging newsrooms not to punish the archive for disputes it did not create. Fight for the Future, the Electronic Frontier Foundation and Public Knowledge backed an open letter praising the Internet Archive’s role in preserving the public record, saying it has helped keep citations alive on millions of Wikipedia entries and does not engage in paywall circumvention or irresponsible scraping. The EFF warned that if major publishers keep locking the archive out, large parts of the historical record may simply vanish, even as courts continue to sort out the separate legal fight over AI training.

Source Reference Map

Inspired by headline at: ^[1]

Sources by paragraph:

Paragraph 1: ^[2], ^[3]
Paragraph 2: ^[4], ^[5]
Paragraph 3: ^[6], ^[7]
Paragraph 4: ^[2], ^[6]
Paragraph 5: ^[4], ^[5]
Paragraph 6: ^[1], ^[6]

Source: Noah Wire Services

More on this

https://www.rappler.com/technology/features/ai-threatens-internet-archive-wayback-machine/ - Please view link - unable to able to access data
https://www.techradar.com/computing/internet/ai-could-mean-the-end-of-the-wayback-machine-as-news-websites-are-increasingly-blocking-it-to-prevent-content-scraping - Major news websites, including The New York Times and USA Today, are blocking the Wayback Machine to prevent AI companies from using archived content to train large language models without permission. This trend raises concerns about the preservation of the web's public record and the potential impact on research and journalism. The Internet Archive's Wayback Machine has been a crucial tool for accessing past versions of web pages, but its role is now threatened by these restrictions. The debate continues over balancing AI development with the need to maintain historical records.
https://fortune.com/2026/04/15/why-is-internet-archive-wayback-machine-not-working-news-outlets-block-ai/ - Major media outlets, including The New York Times and USA Today, are blocking the Internet Archive's Wayback Machine to prevent AI companies from using their content to train models without compensation. This move highlights the tension between preserving digital history and protecting intellectual property rights. The Internet Archive's role in maintaining the public record is under threat, raising questions about the future of web preservation and access to historical information.
https://www.forbes.com/sites/anishasircar/2026/04/14/why-major-news-sites-are-blocking-the-internet-archives-wayback-machine/ - The Internet Archive's Wayback Machine, a digital library preserving web pages since 1996, is facing restrictions from major news sites like The New York Times and USA Today. These publishers are blocking the Wayback Machine's crawlers due to concerns that AI companies might use archived content to train large language models without permission, potentially violating copyright laws and creating competition for original content. This development raises questions about the future of web preservation and access to historical information.
https://cybernews.com/ai-news/internet-archive-wayback-machine-blocked-news-sites/ - Major news outlets, including The New York Times, USA Today, and The Guardian, are blocking the Internet Archive's Wayback Machine to prevent AI companies from scraping their content. This action is driven by concerns that AI developers might use archived material to train large language models without proper authorization, potentially violating copyright laws and undermining revenue streams. The move has sparked debates about the balance between preserving digital history and protecting intellectual property rights.
https://www.tomshardware.com/tech-industry/big-tech/news-outlets-are-blocking-wayback-machine-from-archiving-their-pages-23-outlets-concerned-ai-companies-might-abuse-fair-use-and-use-it-to-train-their-models - As of April 2026, 23 major news publications, including USA Today and The New York Times, are blocking the Wayback Machine’s crawler, ia-archiverbot, from archiving their webpages. The primary concern is that AI companies could exploit archived content under fair use to train large language models (LLMs). This move has sparked debates about the implications for public access to historical records and accountability, as online articles are easily altered or deleted. While the legal system supports the Internet Archive's work as fair use—deeming it crucial for research and discovery—critics argue that blocking archives could harm societal access to information, particularly in an era of misinformation and AI hallucinations.
https://techxplore.com/news/2026-02-news-sites-internet-archive-ai.html - Major news outlets, including The Guardian, The New York Times, and the Financial Times, are blocking the Internet Archive's Wayback Machine to prevent AI companies from scraping their content. This action is driven by concerns that AI developers might use archived material to train large language models without proper authorization, potentially violating copyright laws and undermining revenue streams. The move has sparked debates about the balance between preserving digital history and protecting intellectual property rights.

Noah Fact Check Pro

The draft above was created using the information available at the time the story first emerged. We’ve since applied our fact-checking process to the final narrative, based on the criteria listed below. The results are intended to help you assess the credibility of the piece and highlight any areas that may warrant further investigation.

Freshness check

Score: 8

Notes: The article was published on April 15, 2026, and reports on recent actions by major news outlets blocking the Internet Archive's Wayback Machine. Similar reports have appeared in the past few days, indicating that the narrative is current and not recycled. However, the earliest known publication date of substantially similar content is not specified, so a slight reduction in score is warranted.

Quotes check

Score: 7

Notes: The article includes direct quotes from Mark Graham, director of the Wayback Machine, and other sources. However, the earliest known usage of these quotes cannot be determined from the available information, raising concerns about their originality. Without independent verification of these quotes, a moderate reduction in score is appropriate.

Source reliability

Score: 6

Notes: The article originates from Rappler, a reputable news organisation. However, the article relies heavily on reports from other sources, including TechRadar, Fortune, and Forbes, without providing direct access to the original sources. This reliance on secondary reporting raises concerns about the independence and reliability of the information presented. Additionally, the article includes a significant amount of aggregated content, which may affect its originality.

Plausibility check

Score: 7

Notes: The claims about major news outlets blocking the Wayback Machine due to AI concerns are plausible and align with recent industry trends. However, the article lacks specific factual anchors, such as names, institutions, and dates, which makes it difficult to independently verify the claims. The tone of the article is consistent with typical reporting on this topic, but the lack of detailed supporting evidence raises questions about its credibility.

Overall assessment

Verdict (FAIL, OPEN, PASS): FAIL

Confidence (LOW, MEDIUM, HIGH): MEDIUM

Summary: The article presents plausible claims about major news outlets blocking the Wayback Machine due to AI concerns. However, it relies heavily on aggregated content from other sources without providing direct access to the original sources, raising concerns about the independence and reliability of the information. The lack of specific factual anchors and independent verification further diminishes the article's credibility. Therefore, the overall assessment is a FAIL with MEDIUM confidence.

Internet Archive
AI training
web preservation