Reddit Blocks The Internet Archive From Crawling Its Data - Here's Why

Trending 17 hours ago
gettyimages-2215157577
Andriy Onufriyenko/Getty Images

ZDNET's cardinal takeaways

  • The Internet Archive tin now only crawl Reddit's homepage.
  • Reddit's extremity is to artifact AI firms from scraping Reddit personification data.
  • Publishers (and others) are suing AI companies for copyright infringement.

Reddit is defending its privateness from AI companies that are taking roundabout approaches to scraping its content.

The societal media platform, known arsenic a assets wherever users tin station anonymously and find accusation astir virtually immoderate subject, will artifact nan Internet Archive's Wayback Machine from indexing its online data, according to a Monday report from The Verge. The move is successful consequence to nan find that AI firms, incapable to scrape information from Reddit straight owed to nan platform's prohibitive policies, person alternatively been retrieving its information from indexed contented connected nan Internet Archive and utilizing it to train models.

The Wayback Machine will now only beryllium capable to scrape information from Reddit's homepage, according to The Verge, while entree to personification profiles, comments, and station item pages will beryllium blocked.

Launched successful 1996, nan Internet Archive is simply a non-profit that operates an tremendous integer database of web content. The archive is maintained successful portion by nan Wayback Machine, a portion of web-crawling package that gathers web pages and preserves them arsenic they appeared erstwhile they were collected, for illustration integer flies successful amber. This serves arsenic a assets for researchers studying nan improvement of online civilization and integer forensic grounds for rule enforcement, among different uses.

What Reddit's move means

Reddit has antecedently flagged concerns related to nan scraping of its contented pinch nan Internet Archive, according to The Verge. The non-profit was besides reportedly notified earlier nan web-crawling restrictions started going into effect yesterday.

The Internet Archive has yet to make an charismatic connection astir really it plans to respond to Reddit's caller restrictions, and astatine nan clip of writing, it has not responded to ZDNET's petition for comment. Wayback Machine head Mark Graham, however, has told aggregate publications that nan Internet Archive will "continue to person ongoing discussions astir this matter" pinch Reddit.

Growing tension

Reddit's reported determination to artifact Wayback Machine from scraping nan mostly of its contented arrives during a infinitesimal of mounting hostility betwixt AI companies and integer publishers, though Reddit is nan first tech institution to wade into nan debate. The company sued Anthropic in June aft discovering that nan AI institution was illegally scraping its data, but it has besides antecedently signed licensing deals pinch some Google and OpenAI.

(Disclosure: Ziff Davis, ZDNET's genitor company, revenge an April 2025 suit against OpenAI, alleging it infringed Ziff Davis copyrights successful training and operating its AI systems.) 

AI developers require entree to gargantuan troves of accusation to train generative AI models, which are designed to place and replicate subtle mathematical patterns gleaned from those training datasets.

Many of those companies person scraped training information from publically disposable websites, including societal media sites and news outlets, claiming ineligible immunity nether a conception known successful copyright rule arsenic fair use. (The courts are still untangling nan legitimacy of that argument, and will apt beryllium doing truthful for immoderate time.)

Many of nan organizations whose contented has been copiously scraped -- on pinch a cohort of authors and different artists -- person responded pinch lawsuits. 

Others, meanwhile, person signed contented licensing agreements pinch nan likes of OpenAI, Anthropic, and Google, consenting to nan usage of their organizations' information successful speech for accrued visibility successful nan responses generated by chatbots, aliases different benefits.

More