Voice conversion with so-vits-svc

Open Table of contents

Introduction
Raw dataset
Pretrained models D_0 and G_0
Audio for inference
Final product
Conclusion

Introduction

In this post, I will briefly describe how I trained a voice conversion model using this so-vits-svc 4.0 repository. As for now, so-vits-svc is a black box to me, I do not understand what happens under the hood. In the future, I might read related research papers to understand the theory behind this project.

Raw dataset

The input should be a zip folder containing folders representing speakers, with each folder containing many audio files that are at around 10s or less. The audio files should all be preproccessed, such that there is no noise nor background music. There should only be the target speaker’s voice. All audio files of a speaker should sum up to 1 to 2 hours of data in order to obtain a decent model. I used a dataset that was already made from this HuggingFace dataset repository, which saved me a lot of time. In Azusa.zip, there is only one speaker, to which I named as Azusa. FYI, Azusa is in fact referring to a Chinese VTuber named 阿梓从小就很可爱.

Pretrained models D_0 and G_0

From the same repository, there are D_0.pth and G_0.pth, those are the pretrained models, which serve as the starting point of our training process. I did not use D_0.pth and G_0.pth from the repository above though, I downloaded from some random places, which soon showed me the significance of choosing the correct D_0 and G_0.

For my first try, when I had inferior D_0 and G_0, the output audio at 24800 steps is as follows:

R T · Azi 24800

I wasn’t particularly happy with this result, 阿梓’s voice quality seems very bad even though I was already at like 150 epochs, and I was seeing diminishing progress. So, I obtained much better pretrained base models, which are essentially neutral woman’s voice conversion models. At 4000 steps, it sounds like this:

R T · Azi 2 4000

Then, at 16000 steps, it is like this:

R T · Azi 2 16000

First of all, at 4000 steps, the new model already beats the old model in terms of clarity. Secondly, from 4000 steps to 16000 steps, our model’s voice slowly shifts away from the base model’s voice to our desired target voice. So, it makes sense to choose a base model that can output a girl’s voice, because its voice is already very close to our target, it takes less time to obtain better results.

Audio for inference

After the model is trained well enough, we can use the model for inference. In other words, we feed in an audio file with vocal only, the output will be an audio file with our target speaker saying/singing the same words in the same pitch. For example, the input audio for inference in the examples in the previous section is a segment from the song Nico by advantage Lucy:

R T · Nico Trimmed

However, it is painful trying to make the audio for inference clean. I use Ultimate Voice Remover 5, or UVR5, a powerful AI voice removal tool, to separate music and human voice. But usually songs have multiple voices singing at the same time, or human voice was already processed by equalization. In this case, good luck. I haven’t found out a way to deal with them yet.

Final product

Here is 阿梓 singing advantage Lucy’s chic:

For reference, the original song is like this:

Conclusion

That was fun. Although I didn’t write much code, I learned a lot about audio processing. Furthermore, this may be the beginning of my study about AI in the field of audio processing, I might start with learning VITS, a text to speech technique.