Microsoft’s New AI Makes Convincing Deepfakes Worryingly Easy

The potential for misuse feels dangerous with VASA-1.

April 18, 2024

Microsoft

Here’s another terrifying look into the future of AI, courtesy of Microsoft.

Microsoft introduced the VASA-1 research project that can take a single image and an audio clip and transform it into a high-quality video of a talking head that looks eerily similar to the real thing. We have to stress that it’s just a research project at the moment, meaning it’s not readily accessible, but that doesn’t make it any less disconcerting.

There are innocuous examples with VASA-1, like infusing Mona Lisa with Anne Hathaway’s rap skills but we’re more concerned about the likelihood that this will be used to create deepfakes with a more nefarious purpose — think spreading misinformation or carrying out identity theft.

It’s unsettling how easy it is to churn out a video avatar with VASA-1.

Microsoft

A Simple Recipe

Microsoft explains that you simply upload an image and an audio recording and VASA-1 spits back out a 512 x 512 resolution video with up to 40 fps and barely any latency. Looking at the demos, VASA-1 does a convincing job syncing the audio to the lip movements and can even deliver emotions and expressions through subtle facial movements with eyebrows and head nods.

To finetune the result, VASA-1 lets you control where the generated avatar is looking, how close the model is, and the emotion you want to convey. You can go with a standard neutral expression or inject some happiness, anger, or surprise into your AI-generated video.

On a more unrealistic note, VASA-1 can also handle source material like paintings or singing audio. As convincing as all of these models are, we can still see slight irregularities like some rippling around the ears or an unnatural warping effect with big head movements.

VASA-1 can even be tweaked to convey certain emotions like happiness, anger, and surprise.

Microsoft

Just a Taste of Version 1

As the name hints, VASA-1 is only the first model for Microsoft’s overall VASA framework, meaning this could (or likely will) be improved upon. These initial example videos generated from VASA-1 are only demonstrations of the research project’s capabilities thus far, so again, there aren’t any plans to push this into the public’s hands yet.

“We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations,” Microsoft noted on its website.

As concerning as this tech is and Microsoft does acknowledge its potential for misuse, the research team argues that there are a lot of upsides here. For example, VASA-1 could be used to ensure everyone gets an equal opportunity at education, assist those with communication issues, or even just offer a friendly face to those who need it. Still, if we were placing bets, I’d lean towards tech of this caliber being used for the wrong purposes.

Related Tags