April 20, 2024
A.I

Open Source AI Voice Cloning Arrives with MyShell OpenVoice

Join leaders in San Francisco on January 10 for an exclusive evening of networking, insight and conversation. Request an invitation here.


Startups, including the increasingly well-known ElevenLabs, have raised millions of dollars to develop their own proprietary algorithms and artificial intelligence software to create voice clones: audio programs that imitate users’ voices.

But a new solution is coming, OpenVoice, developed by researchers at the Massachusetts Institute of Technology (MIT), Tsinghua University in Beijing, China, and members of Canadian AI startup MyShell, to offer open source voice cloning that’s nearly instantaneous. and offers granular controls not found on other voice cloning platforms.

“Clone voices with unparalleled precision, with granular control of pitch, from emotion to accent, rhythm, pauses and intonation, using just a small audio clip,” MyShell wrote in a post today on the official account of your company in x.

The company also included a link to its previously reviewed research paper describing how it developed OpenVoice, and links to several places where users can access and test it, including the MyShell web application interface (which requires a user account to access). and HuggingFace. (which can be accessed publicly without an account).

VB Event

The AI ​​Impact Tour

How to get to an AI governance plan: Request an invitation to the January 10 event.

Learn more

VentureBeat reached out via email to one of the lead researchers, Zengyi Qin of MIT and MyShell, who wrote to say: “MyShell wants to benefit the entire research community. OpenVoice is just a start. In the future, we will even provide grants, data sets, and computing power to support the open source research community. The core echo of MyShell is ‘AI for everyone.’”

As for why MyShell started with an open source voice cloning AI model, Qin wrote: “Language, vision and voice are three main modalities of future Artificial General Intelligence (AGI). In the research field, although language and vision already have some good open source models, there is still a lack of a good model for voice, especially a powerful instant voice cloning model that allows everyone to customize the generated voice. “So we decided to do this.”

Using open voice

In my unscientific testing of the new voice cloning model in HuggingFace, I was able to generate a relatively convincing (if somewhat robotic) clone of my own voice quickly, in a matter of seconds, using completely random speech.

Unlike other voice cloning apps, I was not forced to read a specific piece of text in order for OpenVoice to clone my voice. I simply spoke extemporaneously for a few seconds and the model generated a voice clone that I was able to play almost immediately, reading the text message I provided.

I was also able to adjust the “style” between several presets (happy, sad, friendly, angry, etc.) using a drop-down menu, and heard the noticeable change in tone to match these different emotions.

Here is a sample of my voice clone made by OpenVoice via HuggingFace set to the “friendly” style tone.

How OpenVoice was created

In their scientific paper, the four named creators of OpenVoice (Qin, Wenliang Zhao and Xumin Yu of Tsinghua University, and Xin Sun of MyShell) describe their approach to creating voice cloning AI.

OpenVoice comprises two different AI models: a text-to-speech (TTS) model and a “tone converter.”

The first model controls “style parameters and languages” and was trained with 30,000 sentences of “audio samples from two English speakers (with American and British accents), a Chinese speaker, and a Japanese speaker,” each labeled according to the emotion being expressed. in them. She also learned intonation, rhythm, and pauses from these clips.

Meanwhile, the tone converter model was trained with more than 300,000 audio samples from more than 20,000 different speakers.

In both cases, human speech audio was converted into phonemes (specific sounds that differentiate words from each other) and represented using vector embeddings.

By using a “base speaker” for the TTS model and then combining it with the tone derived from the recorded audio provided by the user, the two models together can reproduce the user’s voice, as well as change their “tone color” or emotional expression. of the text that is spoken. Here is a diagram included in the OpenVoice team’s article that illustrates how these two models work together:

The team notes that their approach is conceptually quite simple. Still, it works well and can clone voices using far fewer computing resources than other methods, including Meta’s rival AI voice cloning model, Voicebox.

“We wanted to develop the most flexible instant voice cloning model to date,” Qin said in an email to VentureBeat. “Flexibility here means flexible control over styles/emotions/accent etc., and can be adapted to any language. No one could do this before because it is too difficult. I lead a group of experienced AI scientists and spent several months finding the solution. We found that there is a very elegant way to decouple the difficult task into some feasible subtasks to achieve what seems too difficult as a whole. The decoupled process is very efficient but also very simple.”

Who is behind OpenVoice?

MyShell, founded in 2023 in Calgary, Alberta, a province of Canada, with a $5.6 million seed round led by INCE Capital with additional investments from Folius Ventures, Hashkey Capital, SevenX Ventures, TSVC and OP Crypto, now has with more than 400,000 users. according to The Saas News. I looked at over 61,000 users on their Discord server when I checked earlier while writing this article.

The startup describes itself as a “comprehensive, decentralized platform for discovering, building, and staking native AI applications.”

In addition to offering OpenVoice, the company’s web app includes a host of different text-based AI characters and bots with different “personalities,” similar to Character.AI, including some NSFW. It also includes an animated GIF maker and user-generated text-based RPGs, some with copyrighted properties such as the Harry Potter and Wonderful franchises.

How does MyShell plan to make money by making OpenVoice open source? The company charges a monthly subscription to users of its web app, as well as third-party bot creators who want to promote their products within the app. It also charges for AI training data.

VentureBeat’s mission is to be a digital marketplace for technical decision makers to gain insights into transformative business technology and transact. Discover our Briefings.

Leave a Reply

Your email address will not be published. Required fields are marked *