Sam Leith Sam Leith

Should AI be allowed to train itself off this column?

Credit: iStock

If you’re a writer, should AI companies be allowed to use your work to train their models without your permission? This is a matter of concern for many writers – as it is for artists, musicians, and anyone whose work is being harvested by the industry and spewed out as AI glop. It’s not just that it’s a bit galling to think that our work is being ripped off for free: it’s that, in the medium to long term, these large language models, if they can, will put writers out of jobs. (It’s Christmas – feel free to make the obvious jokes in the comments.)

At present, AI companies have shown few scruples in hoovering up any number of copyright works – be they books, paintings, or pieces of music – and using them to train their generative models. They do this – apparently on the basis that it’s better to ask forgiveness than permission – without offering any compensation to the people who made or hold rights in those works. Then they sell the product that their computers mulch out of all these inputs at considerable profit.

Lots of people, me included, think that that is wrong. We think that writers and artists are there to do more than simply provide training data for someone else’s money machine. We think that those of us who might be happy to supply training data to generative AI should be asked our permission first. And we think it’s a basic principle that a baker doesn’t get to make and sell cakes without paying the miller for flour and the farmer for butter and eggs.

The government, sensing a showdown brewing, has launched an online consultation ahead of putting its proposals on the issue before parliament. I urge anyone who relies on copyright for some or all of their living – that is, writers, journalists, publishers, artists, and so on – to take the time to fill the questionnaire in. It can be found at https://www.gov.uk/government/consultations/copyright-and-artificial-intelligence, and the consultation closes on 25 February, so you have two months.

You might need those two months. The consultation includes more than 50 questions – many of which invite a small essay – spread out over 17 different sections. It took me well over an hour to fill out – and that was at a gallop. Some of the questions are repetitive; some are somewhat leading; many ask the respondent to be more familiar than most can be expected to be with the ins and outs of the technical requirements of large language models or the vagaries of existing copyright exceptions.

‘What are the implications of the use of synthetic data to train AI models and how could this develop over time?’ one asks. ‘Does the current approach to liability in AI-generated outputs allow effective enforcement of copyright?’ asks another. ‘What steps can the government take to encourage AI developers to train their models in the UK and in accordance with UK law to ensure that the rights of right holders are respected?’ was a third, wordy to the point of incoherence and yoking two quite unconnected issues in a single sentence.

You feel a bit of a thickie answering ‘dunno’ so often – but does a writer now have to be an expert in IP law to be entitled to a view on whether someone else should be allowed to use their work without asking? This seems to be of a piece with the basic plan: to put the burden on writers and artists to contribute considerable unpaid labour before the protection of their intellectual property rights can be considered. I don’t think it’s too much to see it as an attempt to bore people into not bothering to respond.

It is clear that the government’s broadly preferred outcome is ‘opt-out’; i.e., unless you specifically refuse permission, big tech companies can use your work without either consulting you or paying you a red cent. That would be a big win for the goliaths of the tech world, and its attraction is obvious to a government desperate to attract them to our Silicon Roundabouts and Silicon Fens. But if it goes through – and this is why I think it should be a topic of interest to more than just those directly affected – it will be a pretty big shift in the way we think about property rights.

I don’t claim to have a position of absolute certainty on this when it comes to calls on the margins. Copyright in language and images is a complex area of law and of moral inquiry. Enforcing copyright too aggressively can be stifling to creativity just as failing to enforce it can be. Parody, quotation, allusion, and sampling in music: all these things exist in a realm of six-of-one-and-half-a-dozen-of-the-other.

Clearly, training an AI model on a chunk of text is not quite the same thing as republishing a copyright work directly for profit – which is what copyright laws were originally intended to prevent. I expect you could make the case, and some people will attempt to, that it’s more like allowing a computer to borrow a library book than printing a fresh copy, or that the contributions of any individual work to a generative AI’s output will be too tiny to deserve protection. You can complain, if you’re a tech bro, that it’s difficult and fiddly to clear rights with individual copyright holders, and that it will be too expensive to compensate them at the scale required. (These two objections feel a bit less convincing and can be filed, I think, under ‘Your problem’.)

Writers and artists should not have to master and explain the technical details of enforcement

But you can also counter that the computer does make a copy of the work in question in its own memory banks, and that in doing so to millions of works, and recombining them at scale for profit, it infringes the rights of those individual creators. And a million tiny infringements add up to one big one. You could point out, too, that though it doesn’t pose the direct short-term threat to a creator’s profit that, say, pirating a book or record does, it lays the groundwork for an existential threat in the medium term. You’re using the artist’s work to train a machine to do something very like what the artist does, so you can compete at a huge advantage in the marketplace with that very artist.

This is, certainly, a topic that deserves a very thorough and well-informed discussion – and in which legal, technical, and political expertise at a high level is required. But the drafting of complex law and technical regulations wants to be downstream of, and informed by, the basic moral intuition that the labourer is worthy of his hire. Writers and artists should not have to master and explain the technical details of enforcement, or get into the weeds of what the robots.txt protocol can and can’t do. Rather, we’d like to be able to say, simply: ‘No tech company should train AI models on copyright material without express permission, and it’s the job of lawmakers, rather than copyright holders, to ensure that they don’t.’ Put it this way: there aren’t very many other sorts of property rights that operate on an ‘opt-out’ rather than an ‘opt-in’ basis.

Comments