TECHNOLOGICAL POSSIBILITIES FOR THE MANIPULATION OF TEXT, IMAGES, AUDIO AND VIDEO
Over the past two years, the term deepfake has become increasingly widespread. But what exactly are deepfakes, and how are they different from other manipulated content?
"Unsurprisingly, in many cases, this has been used to make pornographic videos where the faces of the actresses are replaced by celebrities such as Scarlett Johansson or Taylor Swift“.
Although the first scientific AI-based experiments on video manipulation go back to the late 1990s, the general public only became aware of the technical possibilities towards the end of 2017.
This was also when the terminology was coined, when a Reddit user named “Deepfakes” and other members of the Reddit community “r/deepfakes” published content created by them.
Unsurprisingly, in many cases, this has been used to make pornographic videos where the faces of the actresses are replaced by celebrities such as Scarlett Johansson or Taylor Swift. A more harmless example involved taking film scenes and replacing the face of each actor with Nicolas Cage.
Deepfakes work as follows
Deepfakes (a portmanteau of deep learning and fake) are the product of two AI algorithms working together in a so-called Generative Adversarial Network (GAN). GANs are best described as a way to generate new types of data from existing data sets algorithmically.
For example, a GAN could analyse thousands of pictures of Donald Trump and then generate a new picture that is similar to the analysed images but not an exact copy of any of them. This technology can be applied to various types of content – images, moving images, sound and text. The term deepfake is primarily used for audio and video content.
Today, only a few hundred pictures or audio recordings are required as training data to achieve credible results. For just under $3, anybody can order a fake video of a person of their choice, provided that they have at least 250 pictures of that person – but this is unlikely to be an obstacle for any person that uses Instagram or Facebook. Synthetic voice recordings can also be generated for just $10 per 50 words.
Deepfakes vs Cheapfakes
Although pornographic manipulations are undoubtedly the most common examples of deepfakes, they are not the primary motivation for the current societal debate. Interestingly, the video that sparked the debate was not a deepfake by any means, but simply a cheapfake (sometimes also called a shallowfake): a video of the speaker of the US House of Representatives, Nancy Pelosi, faked with very simple technical means. The recording was slowed to around 75% of its original speed, while raising the pitch so the voice still sounded natural. The results: The viewer was given a plausible impression that Nancy Pelosi was drunk.
The video was shared millions of times on social media. This shows how even the simplest forgeries can distort reality and be exploited for political purposes. Nevertheless, it was historically very difficult to falsify recordings to make the subject perform completely different movements or speak completely different words than in the original video. Until now.
1.0 Manipulation of movement patterns
In 2018, an application by four Berkeley researchers attracted widespread attention, using artificial intelligence to transfer the dance routine of a source person (such as a professional dancer) to a target person. 2)
The movements are transferred from the source video to a “stick figure”. The neural network then synthesizes the target video according to the “stick figure movements”. The result is a “faked” video where a third person dances like a professional. Of course, this type of algorithm could be used not only to imitate dance movements, but potentially to generate any other form of movement. This opens the door to portraying political opponents in compromising situations: What would, for instance, be the ramifications of a video showing a politician performing a Nazi salute or even just giving the middle finger?
2.0 Voice and facial expressions
Forgeries can have even further-reaching consequences by making individuals appear to speak words that were never said, accompanied by gestures, facial expressions and voice impressions that seem incredibly realistic. A series of such videos were created, including examples of Barack Obama and Mark Zuckerberg, not to deceive the audience, but to demonstrate the possibilities and risks of this technology. Since then, there has been an instance where a deepfake was created and distributed by a political party, the Belgian Socialistische Partij Anders (sp.a.).
In May 2018, the party posted a video on Facebook in which Trump mocked Belgium for observing the Paris climate agreement.3)
Despite obviously poor quality and unnatural mouth movements that should rouse the suspicion of any attentive viewer, the video triggered hundreds of comments, many of them expressing outrage that the American president would dare to meddle in Belgian climate policy. The creators of this video were also trying to promote understanding of an issue. The video was a targeted provocation to draw people’s attention to an online petition calling for the Belgian government to take more urgent action on climate issues. But what if someone created a video where Trump talks about a topic other than Belgian climate policy, for example his intent to attack Iran?
Artificial Neuronal Networks
Artificial Neural Networks (ANNs) are computer systems loosely inspired by the biological neural networks found in the brains of humans and animals. ANNs “learn” how to perform tasks based on examples without being programmed with any task-specific rules. They can, for example, learn to identify images containing cats by analysing sample images that have manually been labelled as “cat” or “no cat” and use the results to identify cats in other images.
3.0 Image manipulation: DeepNude and artificial faces
Image and text content are often not categorised as deepfakes, although they can be generated with very similar technology. There is a simple reason for this: both images and texts can be manipulated so easily without requiring complex technology that the “benefit” (or harm, depending on the perspective) of doing so is much smaller than for manipulations of audio and video content. Furthermore, video recordings are much more effective than text and static images at triggering emotions such as fear, anger or hate.
Nevertheless, some examples of AI-based manipulated picture/text content have also attracted attention. As for videos, the primary purpose of image manipulation algorithms is to create fake pornographic content. Applications like DeepNude can convert a bikini photo into a very realistic nude image in a matter of seconds.
Unsurprisingly, the app only works with women (any attempt to select a male image simply generates female genitalia). But this makes each and every woman a potential victim of “revenge porn”, even if no real naked pictures ever existed.
These neural networks are not restricted to the manipulation of images of real people. They can also “create” completely new people – or at least completely new faces.
The commercial applications of this technology are obvious: image databases can be populated more cost-efficiently using AI rather than real people. But this also means that creating fake social media profiles, for example with the purpose of spreading political content, becomes significantly easier.
There have also been suspected attempts of espionage with computer-generated profile pictures, for example the LinkedIn profile of one “Katie Jones”, an alleged researcher working at a US think tank.
Before expert analysis identified several visual anomalies suggesting that the image was synthetic, the profile successfully connected with 52 political figures in Washington, including a deputy assistant secretary of state, a senior adviser to a senator and a prominent economist.4)
The account was quickly removed by LinkedIn but is thought to have belonged to a network of phantom profiles, some of which may still exist, that could be used for phishing attacks.
4.0 AI-generated texts
The application described above can be implemented particularly effectively in combination with AI-driven text generation. Many people may already have heard of this possibility thanks to the GPT-2 text generator created by the research company OpenAI. Due to the potential for abuse, GPT-2 was originally considered too dangerous to be made available to the general public.5) The company later decided to publish GPT-2 in several stages, since its creators have so far been unable to find any clear evidence of misuse.6)
Even if there has not yet been misuse, the creators acknowledge that people would largely find the text generated by GPT-2 credible, that the generator could be fine-tuned to produce extremist content, and that identifying generated text would be challenging. With the “Talk to Transformer” application, anybody can try out GPT-2 for themselves.
Entering one or more sentences into the generator outputs a piece of text beginning with the submitted input. The results are often – but not always – surprisingly coherent. They strike the same tone as the input and simulate credibility by inventing experts, statistics and quotes.