For our research into building reach on the internet, we are working with modern rendering techniques and machine learning for (automated) content creation, among other things. One aspect is generative AI to automatically generate finished videos with sound from minimal input (text). In the following, so-called deepfake videos are created from text using free tools by my boss Michael Mlynarski will be created. This works e.g. locally on my computer (requires an RTX-compatible graphics card, i.e. GeForce RTX 2060 or higher). The approach has the following steps:

For the cloning of his real voice I have a short online course [1], which teaches the use of the open source ML framework SV2TTS with special adaptations for the German language [2-3] (commercial web alternatives are e.g. resemble.ai, play.ht, coqui.ai). It consists of three components: Encoder, Synthesizer and Vocoder. The standard model from [1], pre-trained with English, is used for the vocoder. The encoder and synthesizer are trained with a data set (consisting of many short 5-30s speech segments in *.wav format as well as the text spoken in the segments) with several speakers for the German language. For this purpose, the data sets from M-AILABS [4], the HUI Audio Corpus German (Clean Version) [5] and Thorsten Voice [6] were merged and used simultaneously.
Since German voices were not yet recognizable during the cloning process with only a short speech sample (~5 seconds), the synthesizer was specially retrained. A short data set of ~15 minutes with voice segment & text can be created with [7], for example, but the boss’s calendar is always very full (and I’m too impatient).

We therefore used a recent podcast recording with the voice to be cloned. Many thanks go to Tobias Fleming for spontaneously creating short pre-annotated speech segments with OpenAI’s Whisper [8], which I processed a little with Audacity [9] to create a final 12.5-minute data set. Trained in this way, the model actually sounds like the boss:
Text to speech: “I think it’s pretty cheeky that you stole my voice. I mean, what’s next? Are you stealing my glamorous looks and making videos of me?” (Spoiler: yes)
To create a video from the speech, a facial animation is first created from the sound file using NVIDIA Audio2Face [10] and exported as *.usd for use in Unreal Engine 5 [12] Metahumans [13] according to YouTube tutorial [11]. The choice of metahuman plays a rather subordinate role, as the video created with Movie Render Queue [14] is only used as a driver for the head movement based on a photo. Depth-Aware Generative Adversarial Network for Talking Head Video Generation (CVPR 2022) [15] was used for this purpose. The result looks like this:
AI-Michael: “Hello, I am the clone of Michael, who has come to life thanks to the latest technology. I always strive to use my skills for the benefit of all and to provide help and advice. Just like Michael, I have a great passion for technology and innovation. My mission is to improve society’s understanding of artificial intelligence and thus make the world a better place.”
The photo can also be stylized with e.g. Stable Diffusion (Inkpunk-Diffusion-v2):
The opportunities offered by the technology were immediately recognized by the management:
Fortunately, this technology is in best hands.
- https://www.udemy.com/course/voice-cloning/
- https://github.com/CorentinJ/Real-Time-Voice-Cloning
- https://github.com/padmalcom/Real-Time-Voice-Cloning-German
- https://github.com/imdatsolak/m-ailabs-dataset
- https://github.com/iisys-hof/HUI-Audio-Corpus-German
- https://www.thorsten-voice.de/
- https://github.com/padmalcom/ttsdatasetcreator
- https://github.com/openai/whisper
- https://www.audacityteam.org/
- https://www.nvidia.com/en-us/omniverse/apps/audio2face/
- https://www.youtube.com/watch?v=x9POZqGO5B0
- https://www.unrealengine.com/en-US/unreal-engine-5
- https://www.unrealengine.com/en-US/metahuman
- https://docs.unrealengine.com/5.1/en-US/render-cinematics-in-unreal-engine/
- https://github.com/harlanhong/CVPR2022-DaGAN
Would you like to discuss trends in corporate learning or agile leadership skills? Inez is here for you! Write to us at hello@qualityminds.de