Blocking the Internet Archive Won’t Stop AI Training — It Will Erase the Web’s Memory

Blocking the Internet Archive Won’t Stop AI Training — It Will Erase the Web’s Memory
Major publishers have begun blocking the Internet Archive’s crawlers, most prominently The New York Times, which has moved beyond traditional robots.txt controls to prevent the Wayback Machine from capturing and preserving its pages. The Internet Archive — operator of the Wayback Machine and the web’s largest digital library with more than a trillion archived pages — is a daily research tool for journalists, historians, courts, and the public. By cutting access to nonprofit archival crawlers, publishers risk dismantling the only consistent public record of how news appeared online, including edits, corrections, and retractions that otherwise disappear from the live web.
Publishers say their actions respond to alarm over AI companies scraping news content to train large language models and other systems, and several news organizations have pursued litigation over the use of copyrighted material in AI training. Even if courts ultimately side with publishers, removing archival access from institutions that preserve history is a disproportionate response. The Internet Archive is not a commercial AI trainer; it functions more like a library making copies for preservation and discovery. Libraries and archives have long been treated differently in copyright law when their copying serves research, access, and public-interest purposes. The collateral damage of blanket technical blocks is the progressive loss of an evidentiary record that many depend on to understand how reporting and public discourse evolved.
Removing the Archive’s ability to preserve news creates long-term costs that extend far beyond the immediate dispute about AI training data. Journalists will lose a stable source for sourcing and verifying historical claims, researchers will face gaps in longitudinal studies of media, and courts may lack contemporaneous records of online publications. Rather than erecting technical barriers, publishers and archives should explore targeted agreements that protect commercial interests without erasing the public record.
Key implications
- Historical record at risk: Blocking archival crawlers removes the only independent snapshots of many news pages.
- Misplaced remedy: Technical blocks target nonprofit preservation, not commercial model builders.
- Legal and policy friction: Ongoing lawsuits over AI training could decide access norms, but interim archival loss is irreversible.
- Need for solutions: Negotiated access, selective embargoes, and legal clarifications could balance rights and preservation.
Stay connected and browse safely with Doppler VPN.
Sources:
Gotowy, by chronić swoją prywatność?
Pobierz Doppler VPN i zacznij bezpiecznie przeglądać internet już dziś.

