In 2023, nine books authored by an unnamed writer, along with approximately 180,000 others, were reportedly subjected to copyright infringement after being unlawfully scraped from a pirate website for inclusion in the Books3 dataset, which is used to train large language models (LLMs) by major tech companies. This issue has been brought to light following a significant breach involving Meta, which allegedly accessed 7 million books from the LibGen pirate site, where the author's work, including foreign editions, was also found.

According to the author, they have come to the unsettling conclusion that the copyright status of their life’s work has been compromised within just a few years, primarily due to actions taken by tech companies conducting AI training. They highlighted findings from Alex Reisner’s investigation published in The Atlantic, suggesting that employees at these tech firms were aware of the illegality of their actions, with "deception preceding theft and... following theft."

Despite the distress caused by these violations, the author has sought to explore potential solutions. They have engaged with a licensing company, establishing a process to designate that no books authored by them may be exploited for AI purposes. This involves inputting details for 216 titles and setting up exclusions across two categories of AI usage. However, the author described this task as daunting, stating, "it’s like spitting into a strong wind," given the previous collusion between companies and pirate sites that facilitated such breaches.

The author reiterated a significant concern: once content has been scraped to train AI, it cannot be "unlearned." This complicates any efforts to reclaim rights, as the base material has already been integrated into AI models. They expressed frustration with the British government’s recent proposal that the training of AI on copyrighted works should largely fall under "fair usage." The proposal suggests that to stop their work from being used for AI training, authors must "opt out" every edition in every territory, which could prove unmanageable and impractical for many.

Moreover, the recent legislation appears to enable businesses to copyright derivatives produced from the training data, a point the author finds particularly troubling. They argue that this effectively endorses copyright infringement of existing works while simultaneously protecting the derivatives born from that infringement.

In lieu of concentrating on new fiction, the author has shifted their focus towards advocating against this proposed legislation, emphasising its inequity and potential repercussions for the literary community and cultural heritage. They contend that not only is there a financial aspect at stake, but crucial elements of humanity's capacity for abstract thought and storytelling are also under threat.

In a positive development, the House of Lords in the UK has reportedly called for amendments to the legislation, seeking improvements that would require AI crawlers to acknowledge UK copyright law and inform creators if their work has been scraped. Yet, the author lamented that discussions seem predominantly centred on economic concerns rather than the intrinsic value of creative works.

Ultimately, the author expressed the belief that the stakes are higher than mere revenue, arguing for the preservation of cultural narratives that embody wisdom across generations, while also raising concerns about the broader societal implications of technology on truth and knowledge.

Source: Noah Wire Services