Abstract
Zero-shot voice conversion (VC) aims to alter the speaker identity in a voice to resemble that of the target speaker
using only a short reference speech. While existing methods have achieved notable success in generating intelligible speech,balancing the trade-off between quality and similarity of the
converted voice remains a challenge, especially when using a short target reference. To address this, we propose ExVC, a zero-shot VC model that leverages the mixture of experts (MoE)
layers and Conformer modules to enhance the expressiveness and overall performance. Additionally, to efficiently condition the model on speaker embedding, we employ feature-wise linear
modulation (FiLM), which modulates the network based on the input speaker embedding, thereby improving the ability to adapt to various unseen speakers. Objective and subjective evaluations
demonstrate that the proposed model outperforms the baseline models in terms of naturalness and quality. Audio samples are provided at: https://tksavy.github.io/exvc/.