AI’s Desperate Hunger For News Training Data Has Publishers Fighting Back. Here’s How.

As desperate AI companies face training data shortages, news organizations are finding new ways to fight back against the AI scraping of their content without permission (Jon Accarrino/MidJourney). This is the debut of TVNewsCheck’s new AI & TV column from veteran executive Jon Accarrino, founder of the media technology and AI strategy firm Ordo Digital.

The AI race has reached a new fever pitch, with companies like Google, Meta, Anthropic and OpenAI locked in an all-out sprint to develop advanced AI models. But as these companies burn through the internet’s high-quality training data at an astonishing rate, some AI companies are resorting to controversial tactics — including the alleged mass scraping of news articles.

Now, the journalism world is beginning to fight back against what some are calling the “largest theft in the United States.” Here’s where things stand in the escalating battle between Big Tech’s data-hungry AIs and a news industry determined to protect its content.

Throwing Up Walls Against Bots

Concerned news organizations, including Graham Media Group, The New York Times, The Guardian, Hearst, and Hubbard Broadcasting have already blocked AI chatbots like OpenAI’s ChatGPT and Google’s Gemini from scraping their sites. That list keeps growing by the day.

Why the sudden alarm? Many publishers, analysts and press freedom advocates see the rise of AI scraping as an existential threat — not only to their business models, but to the fundamental integrity of journalism itself.

They worry that training chatbots on news articles, without oversight, could turbocharge the already challenging problems of misinformation and synthetic content online.


“It is clearly possible that some groups or organizations use and fine-tune models to create tailored disinformation that suits their projects or their purpose,” warned Vincent Berthier of Reporters Without Borders in an interview with VOA News.

Newsrooms Look For Ways To Defend Their Content Against AI Data Scraping

As media organizations confront the challenge of AI scraping, newsrooms have several ways to help defend their content:

Updating terms of service to ban AI scraping. Among the first steps many newsrooms are taking to protect their content is updating their terms of service to disallow AI scraping. In August 2023, The New York Times updated its terms of service and banned scraping its text, photos, audio and other content for machine learning purposes. While not a foolproof deterrent, strongly defined terms against unauthorized AI data scraping can provide important legal footing.

Blocking AI data scraping bots. A growing number of news sites are now blocking web crawlers associated with AI chatbots in their robots.txt files. There are limitations to this approach — it requires constant monitoring of new bots and not all bots obey these rules. An AI won’t obey a paywall or robots.txt rules unless it was programmed to do so. For example, OpenAI ignored robots.txt rules until August 2023. But blocking crawlers is still a crucial first line of defense.

Example of a website blocking OpenAI’s GPTBot (via

Licensing Training Content To AI Companies. If you can’t beat ’em, charge ’em? Some news organizations are exploring licensing their content to AI companies for training data. Axel Springer, owner of Business Insider, sold access to its content to OpenAI. The Financial Times recently announced a similar deal. Also, startups like Dappier are building marketplaces for publishers to license access to their content for AI training purposes, but on the publishers’ terms. As Dappier explains it: “We’re doing what iTunes did to Napster with pirated music. Build a marketplace that makes it easier to buy data than steal it.” Dappier’s solution would allow both large and small AI companies to pull from an endless supply of content for training purposes from a large variety of sources.

News Orgs Are Creating Their Own LLMs. Newsrooms are also exploring ways to play offense by training their own LLMs on their high-value archive of content. At TVNewsCheck’s Programming Everywhere panel last month about using AI for content production, Tegna’s CTO, Kurt Rao, discussed how Tegna employees have access to an internal AI tool called TegnaGPT. The Financial Times is testing its own LLM to leverage its archive for generating stories. Likewise, the Southeast Missourian is enabling its audiences to directly query its archive of local reporting and to use AI to quickly add historical context for article updates. “These approaches both preserve and leverage the high value of a newsrooom’s IP,” added Frank Mungeam, CIO of the Local Media Association. “It helps put AI to work for the benefit of both the audience and the publisher.”

Turning To The Courts

If your intellectual property has already been AI scraped from your platforms, then news organizations may need to take a different approach.

The New York Times and several other organizations have filed lawsuits accusing OpenAI and Google of illegally harvesting “massive amounts of personal data” to train their AI chatbots. The Times’ suit against OpenAI alleges that the company behind ChatGPT is “secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans.” A similar complaint aimed at Google claims the search giant hoovered up content from subscription-based sites and even “websites known for pirated collections of books.”

Both cases are seeking to force the AI companies to implement stronger safeguards and allow people to opt-out of collection before further development. OpenAI and Google have pushed back, arguing their practices fall under fair use.

Exploring Options With Synthetic Data 

As more publishers put up barriers to web scraping, AI companies are hunting for alternative paths forward. One experimental strategy is to lean more heavily on so-called “synthetic data” — essentially using AI to generate its own feedstock for machine learning.

If the thought of AI writing content to train AI about how to generate news content makes you raise an eyebrow, you aren’t alone. Many experts are skeptical that such an approach can create any real value. Journalism, they argue, remains an irreplaceable resource for teaching AI about the real world.

Striking A Truce Through Collaboration

While courtroom battles over AI scraping are likely to escalate, a growing consensus believes the wiser path is collaboration. By proactively licensing access to their content as training data, either through direct deals or data marketplaces like Dappier, news organizations can safeguard their intellectual property, support high-quality AI development and even open up new badly needed revenue streams.

Ultimately, two things are becoming clear: AI needs journalism to thrive, and journalism needs to find a sustainable way to coexist with AI. The future of both could depend on it.

This is the debut of TVNewsCheck’s new AI & TV column from veteran executive Jon Accarrino, founder of the media technology and AI strategy firm Ordo Digital.

Comments (0)

Leave a Reply