The first major crack in the wall came in July 2023, when The Associated Press struck a deal allowing OpenAI to licence part of its text archive. The agreement was modest in scope and its financial terms were not disclosed, but it marked an important shift: instead of relying only on scraped material and courtroom arguments, AI developers began cutting cheques to publishers whose content had helped train their systems. According to the AP, the partnership was framed as an effort to explore generative AI in news products while maintaining standards for responsible use.

That move did not happen in isolation. OpenAI went on to sign arrangements with Axel Springer, Condé Nast, News Corp, Hearst, The Atlantic, Reddit and others, while Google pursued similar publishing partnerships of its own. In December 2023, TechCrunch reported that OpenAI’s deal with Axel Springer covered both training and the inclusion of recent articles in ChatGPT, and in August 2024 OpenAI said it had reached another agreement with Condé Nast to surface stories from titles including Vogue, The New Yorker and Wired in ChatGPT and SearchGPT. Taken together, those agreements helped turn licensing into a fast-growing market rather than a one-off compromise.

The rush has been driven by legal uncertainty. More than a hundred copyright cases are now moving through US courts, and the central issue remains whether training models on scraped online material falls within fair use or amounts to infringement. For companies building large language models, the risk is not only theoretical: if a court later finds the underlying data was unlawfully used, the development history of a model can become a liability. That makes licensing look less like philanthropy and more like risk management.

The deals also serve purposes beyond copyright clearance. OpenAI’s arrangement with Reddit, for example, was not simply about text rights, because much of what appears on Reddit is owned by users rather than the platform itself. Licensing can also secure reliable API access and reduce contract disputes. At the same time, the rise of retrieval-augmented generation has complicated the picture further, because models increasingly pull from live web sources before answering prompts, creating fresh questions about copying at inference time rather than just during training.

Even so, the market remains uneven. The biggest publishers are able to negotiate, monitor usage and enforce terms, but independent writers, local outlets and smaller websites usually cannot. That leaves the new licensing economy looking less like a broad solution than a settlement among institutions with enough scale to participate. In the UK, Getty Images’ case against Stability AI has added another layer of uncertainty: Getty initially included claims that the system reproduced protected images in its outputs, before later dropping that part of the case. The result is a legal landscape in which training, retrieval and output are all being tested at once, while investors and dealmakers try to value AI businesses that may be built on contested data.

Source Reference Map

Inspired by headline at: [1]

Sources by paragraph:

Source: Noah Wire Services