Opinion: AI tools like ChatGPT are built on mass copyright infringement

Zainab Choudhry is a startup founder who has worked in law, technology and media in New York and Toronto.

The old world, wherein only human minds had evolved to create stories, art, music and poetry, is no more.

We have now entered the era of generative artificial intelligence, a type of AI that can create new content based on datasets of existing content it has been “fed” and trained on. In this new era, generative AI developers are building the minds of these machines by training them on content created by humans of the past and present, from Shakespeare to Atwood, Caravaggio to Koons. Thus far, we have marvelled at the creations that generative AI tools such as ChatGPT have produced, but this use of AI raises crucial ethical and legal questions.

It takes enormous amounts of data to train a generative AI program like ChatGPT, and in order to build these tools cheaply and quickly, developers are committing mass copyright infringement. These datasets are largely created by combing and scraping the internet for every type of content, from articles, books and artwork to our photos and tweets. These methods give rise to some big questions: Is the use of our copyright-protected content for training generative AI models legal? Does the use of copyrighted content for training AI fall under fair-use exceptions in the United States and fair dealing in Canada? Do we have a right to compensation when our work is being fed to the machines?

As a former copyright startup founder equipped with a law degree and a long-standing career at the intersection of intellectual property (IP) law, media and tech, I know the rules broadly boil down to one central tenet: To use someone else’s original content, you must get their permission, barring some exceptions. In my opinion, using copyrighted content to train a generative AI, without permission, easily falls under copyright infringement. If you train a generative AI model on the content of a particular painter or poet’s work, or even a singer’s voice, the AI can do a pretty good job of replicating the exact content and style of those paintings, poems or vocals in the new works it creates. At its lightning speed, generative AI can train on and write a new book based on an author’s work long before the human author ever could.

Clearly, we need legal protections for authors, artists and content creators who do not want their work to be used in training AI. For now, however, it is a Wild West. Generative AI developers are showing no regard for copyrighted content, nor are they seeking consent from authors and artists to use their content, and there has certainly been no compensation offered. As a slew of recent copyright-infringement lawsuits against generative AI developers has emerged, my hope is that the protections in our copyright laws are upheld. However, relying purely on precedent in this area is risky, with two major pitfalls. First, it takes a long time for lawsuits and appeals to cycle all the way to the Supreme Court. Second, our collective IP is at stake: A weak case can set us up for failure if a precedent is not established in favour of our intellectual protections.

Thrust into this new terrain, the best way to resolve any AI legal concerns is to develop new laws and regulations rather than interpreting existing ones. Our current copyright laws were simply not created with AI and its capabilities in mind. Time is of the essence, and trying to regulate the content generated by AI is like chopping off the heads of a hydra. Instead of playing catch-up, we need to regulate generative AI at the source: its training data. We need to protect our intellectual property, and we need to limit the ability of generative AI developers to freely plunder our minds without permission or compensation.

In many cases, people want generative AI – it can be a very helpful tool. But we can’t forget that generative AI needs human-generated content to train on – it is not a symbiotic relationship. Regulating AI training data by only allowing for the input of ethically and legally sourced content is imperative to a healthy technological future.

Generative AI is a genie that has granted us endless wishes in our quest for new content, but it’s also a genie that we can never put back in the bottle. We are in a new world now, and as nostalgic as we may be for the old way of creative output – the human way – there is no going back. But we still have an opportunity to properly regulate generative AI tools and the data they are trained on. Until then, I will continue to label my own work as copyright protected, while crossing my fingers that this designation remains relevant in the future.

Latest in

Interact with The Globe