ExVC: Leveraging Mixture of Experts Models for Efficient Zero-shot Voice Conversion

Obed Irihose and Le Zhang

School of Information and Communication Engineering,
University of Electronic Science and Technology of China (UESTC), Chengdu, 611731, China

Abstract

Zero-shot voice conversion (VC) aims to alter the speaker identity in a voice to resemble that of the target speaker using only a short reference speech. While existing methods have achieved notable success in generating intelligible speech,balancing the trade-off between quality and similarity of the converted voice remains a challenge, especially when using a short target reference. To address this, we propose ExVC, a zero-shot VC model that leverages the mixture of experts (MoE) layers and Conformer modules to enhance the expressiveness and overall performance. Additionally, to efficiently condition the model on speaker embedding, we employ feature-wise linear modulation (FiLM), which modulates the network based on the input speaker embedding, thereby improving the ability to adapt to various unseen speakers. Objective and subjective evaluations demonstrate that the proposed model outperforms the baseline models in terms of naturalness and quality. Audio samples are provided at: https://tksavy.github.io/exvc/.

Code are available Here

Audio Files

Index Source Reference yourTTS FreeVC kNN-VC ours (w/o MoE) ExVC

Experiment on the impact of reference speech duration while the source speech duration is keept constant at 3 seconds.

Audio Files

Duration Source Reference yourTTS FreeVC kNN-VC ours (w/o MoE) ExVC