In late March 2024, the State of Tennessee enacted the Ensuring Likeness, Voice, and Image Security Act of 2024 (the ELVIS Act) which is designed to protect performers from misuse of their voice and image, including preventing the creation of ‘soundalikes’ and ‘deepfakes’ using AI. It is the first of its kind in the USA, and there is no equivalent legislation in Australia.

Just how good is AI on audio generation?

While generative AI’s ability to generate whole texts, startling images and short videos from text has captured most public attention,  Stanford University’s 2024 Global AI Index  says that “2023 marked a significant year in the field of audio generation, which involves creating synthetic audio content, ranging from human speech to music files.” 

The step-up in AI audio generation capability is thanks to UniAudio, a high-level language modelling technique to create audio content. Taking a similar architecture to large language models, UniAudio uniformly tokenizes all audio types and employs next-token prediction for high quality audio generation. UniAudio has over 1 billion parameters and is trained on 165,000 hours of audio.

Text-to-audio generation has become so good that the 2024 AI Index raises concerns about the potential for voice generation to impact on the political process by making it significantly easier to create large-scale disinformation using a range of deepfake methods including face-swapping, lip-syncing and voice conversion, illustrated by the following live example:

Shortly before the election, a contentious audio clip emerged on Facebook purportedly capturing Michal imeka, the leader of the Progressive Slovakia party and journalist Monika Tdov from the newspaper Dennk N, discussing illicit election strategies, including acquiring voters from the Roma community. The authenticity of the audio was immediately challenged by imeka and Dennk N. An independent fact-checking team suggested that AI manipulation was likely at play. Because the clip was released during a pre-election quiet period, when media and politicians’ commentary is restricted, the clip’s dissemination was not easily contestedUltimately, the affected party, Progressive Slovakia, lost by a slim margin to SMER, one of the opposition parties.

The 2024 AI Index notes that research shows that humans are not very good at detecting deep fake audio:

  • Listeners only correctly detected deepfakes 73% of the time.

  • Training humans to detect deepfakes only helps slightly.

  • Listening to clips more frequently does not aid detection.

  • Spending more time does not improve detection.

  • AI seems to be no better than humans at detecting deep fake audio. The US National Public Radio fed a mix of real on-air news stories and audio fakes into three leading detection AIs. While one detection AI got all but three correct, another model got 25% incorrect and the other got about half wrong, failing to catch most of the AI-generated clips. NPR has some real and fake news clips - see if you can tell the difference here .

Just to make things more complicated, the 2024 AI Index notes that growing scepticism around ‘audio authenticity’ with the spread of deep audio fakes also gives unscrupulous politicians the opportunity of a “liar’s dividend,” the ability to dismiss damaging audio clips as fabrications.

The ELVIS Act

A number of US jurisdictions are taking action against AI imitating a real person’s voice.

The ELVIS Act in fact modifies an existing piece of legislation in Tennessee, the Personal Rights Protection Act  of 1984 (the PRP Act). The PRP Act was introduced several years after the death of Elvis Presley and was intended to extend his publicity rights by prohibiting the use of an individual’s name, image, or photograph “for the purposes of advertising”. It did not include protection for an individual’s voice.

The ELVIS Act expands the PRP Act, by including voice in the regime (to specifically address soundalikes), extending the scope beyond solely advertising, and adding a form of secondary lability for platforms that disseminate infringing works, as well as AI and tech companies who make available technology designed primarily to create soundalikes and deep fakes.

The amendments that the ELVIS Act introduces are summarised as follows:

  • Every individual has a property right in the use of their name, photograph, voice, or likeness in any medium in any manner;

  • ‘Voice’ is defined as a sound in a medium that is readily identifiable and attributable to a particular individual, regardless of whether the sound contains the actual voice or a simulation of the voice of the individual;

  • Any person who uses an individual’s name, photograph, voice or likeness in any medium will in liable for breach; and

  • A person will also be liable where they:

    • Publish, perform, distribute, transmit, or otherwise make available to the public an individual’s voice or likeness, with knowledge that such use was not authorized by the individual; or

    • Distribute, transmit, or otherwise make available an algorithm, software, tool, or other technology, service, or device, where the primary purpose or function is the production of an individual’s photograph, voice, or likeness with knowledge that doing so was not authorised by the individual.

The two last provisions appear to be aimed, firstly, at online platforms which allow the communication of the work to the public (including intermediaries such as social media companies) and, secondly, companies that create or make available technology which has the primary purpose or function  or making soundalikes and or deepfakes.

In addition to the obvious misleading nature of soundalikes, record companies are particularly concerned about the large number of soundalike recordings that are being made available via platforms such as Spotify because they dilute the pool of royalties being paid to legitimate artists recordings on those platforms.

In other regulatory interventions against sound-alikes in the US:

  • The Federal Communications Commission has banned AI-generated robocalls .

  • The Federal Trade Commission, in an innovative response for government, is offering a $25,000 prize for the best solution to tackle voice cloning scams . The FTC observed in the prize announcement that voice cloning was not only a scourage of celebrities and politicians but also of ordinary folk:

While voice cloning technology holds out hope for some people - for example, those who have lost their voices to accident or illness - the FTC has called attention to the ways that fraudsters are adding AI to their arsenal of artifice. You’ve probably heard about family emergency scams where a person gets a call supposedly from a panicked relative who’s been jailed or hospitalized and needs money immediately. Until recently, scammers had to come up with excuses for why the voice might not sound quite right. But enter artificial intelligence and the crook on the other end could use voice cloning technology to impersonate the family member.

Australia

In Australia, unlike the USA, there is no express protection for soundalikes under the Copyright Act 1968  (Cth). This is because for a copy of a sound recording or film to be copyright infringing, the case law has established that it needs to be a facsimile copy of the original recording or film (or a facsimile copy  of substantial part of it).

Similarly, individuals in Australia do not have an express right of publicity or protection of likeness.

Whilst it may be possible for copyright owners and individuals to bring an action for misleading or deceptive conduct under the Australian Consumer Law (ACL) or the common law tort of passing off, (and there have been some successful actions where, for example, images of a high-profile individual have been used in advertising without authorisation), there are some challenges to bringing such an action. These include that claims under the ACL depend on the conduct being in ‘trade or commerce’ and the difficulties that can arise in relating to proving the necessary misrepresentations to the standard required by the Courts and to the relevant audience.

The ELVIS Act, on the other hand, avoids these issues and will have more general application.

Conclusion

Given the advances AI voice generation and other deepfake technologies, and the disarming simplicity of the ELVIS Act as an example, it may be an appropriate time for further consultation between government and industry in relation to legislative solutions to some of the challenges posed by sound-alikes, likeness and deepfakes in Australia and other jurisdictions.

But even if we are able to reign in sound-alikes using legislative tools along the lines the ELVIS Act, sophisticated text-to-audio AI generated content is still likely to have a transformative impact on our public life. India’s election currently underway may show what’s to come in AI audio generation in political campaigns when deployed by ‘genuine’ players. The New York Times reports that in the current Indian election campaign politicians are sending videos by WhatsApp in which their AI avatars deliver highly personalised messages to individual voters (“Hello Akash, it’s nice to meet you..”) about the government benefits they have received and asking for their vote. With about 5 minutes of video and audio material from a political client, the AI developer, with a mix of open sources software, developer code and 6-8 hours of ‘tweaking”, is able to build an avatar which convincingly co-ordinates gaze, facial expression and voice (see here).

*This is a summary of a longer article for the Communications Law Bulletin published by the Communications and Media Law Association (CAMLA).

 

Read more:  Using AI to detect AI-generated deepfakes can work for audio - but not always