Microsoft’s VALL-E can Clone any Human Voice in Just 3 Seconds

Microsoft VALL-E
2 mins read

Yes, you heard it right!

Now, you can clone your voice in just 3 seconds sample with the help of Microsoft’s VALL-E new AI-Powered text-to-speech (TTS) model.


I put my videos on YouTube and also have online courses. As a content creator, I have to deal with my audio and video. Audio and Video syncing are required sometimes and synching both is tough and also time-consuming.

I always keep looking for a tool that does audio and video synch easily. Hopefully, Microsoft VALL-E is the solution. However, Microsoft has not made this model public as there are moral issues with it, although it can be applied in many beneficial ways.

What is Microsoft’s VALL-E?

VALL-E (Virtual Agent Learning Lab – Extended) is an AI-based virtual agent developed by Microsoft Research.

VALL-E is sometimes referred to as a “neural codec language model” by Microsoft. Compared to other voice generators that are available on the internet, VALL-E takes a new technique, which helps it attain far higher accuracy. One of them is that the TTS training data was expanded to 60,000 hours of English speech, which according to Microsoft is hundreds of times more speech than what is currently available in systems.

The TTS system can now create “high-quality customized speech” using nothing more than a 3-second audio recording of any individual as an “acoustic prompt.”

How VALL-E works?

The magic happens in the “Neural Codec Language Model”.


Image Courtesy: GitHub

The speaker has to speak for 3 seconds as input to the model. An audio codec encoder creates a coded signal from the input audio stream.

VALL-E creates discrete audio codec codes based on this neural codec language model.

Audio Codec decoder converts discrete audio codec codes into an audio waveform. And that is the final output.

Many speech samples are available on GitHub.

In combination with another innovative Artificial Intelligence models like Chat GPT-3, VALL-E directly enables applications for voice synthesis, including zero-shot TTS, voice editing, and content creation.

VALL-E can also change the emotions of the speaker.

Wrapping Up

AI-based technology like VALL-E can have huge benefits. It can make content creators’ jobs very easy and quick.

But technology has a dark side too.

The same is the case with VALL-E. It can be misused in a variety of ways. Like a person can spoof another person’s voice and misuse it to generate foul language. A process to guarantee that the speaker authorizes the use of their voice and a synthesized speech recognition model should be included if the model is generalized to unseen speakers in the real world.



Please enter your comment!
Please enter your name here