Improving Recognition-Synthesis Based Any-to-One Voice Conversion with Cyclic Training

Authors:

Yan-Nian Chen, Li-Juan Liu, Ya-Jun Hu, Yuan Jiang, Zhen-Hua Ling

Abstract:

In recognition-synthesis based any-to-one voice conversion (VC), an automatic speech recognition (ASR) model is employed to extract content-related features and a synthesizer is built to predict the acoustic features of the target speaker from the content-related features of any source speakers at the conversion stage. Since source speakers are unknown at the training stage, we have to use the content-related features of the target speaker to estimate the parameters of the synthesizer. This inconsistency between conversion and training stages constrains the speaker similarity of converted speech. To address this issue, a cyclic training method is proposed in this paper. This method designs pseudo-source acoustic features, which are generated by converting the training data of the target speaker towards multiple speakers in a reference corpus. Then, these pseudo-source acoustic features are used as the input of the synthesizer at the training stage to predict the acoustic features of the target speaker and a cyclic reconstruction loss is derived. Experimental results show that our proposed method achieved more consistent accuracy of acoustic feature prediction for various source speakers than the baseline method. It also achieved better similarity of converted speech, especially for the pairs of source and target speakers with distant speaker characteristics.

note:

Due to data confidentiality of some source speakers in our subjective evaluation, the source speakers presented here are not the same as the ones in our subjective evaluation of the paper.

 


1.1. Natural speeches of the female target speaker (TF)

 

1.2. Comparision when speaker characteristics of source and target speakers are not distant, with two female and one male source speakers

 

source
baseline
proposed

 

1.3. Comparision when speaker characteristics of source and target speakers are distant,with three male source speakers

 

source
baseline
baseline+adv
proposed
proposed-adv

 


2.1. Natural speeches of the male target speaker (TM)

 

2.2. Comparision when speaker characteristics of source and target speakers are not distant, with three male source speakers

 

source
baseline
proposed

 

2.3. Comparision when speaker characteristics of source and target speakers are distant, with two female and one male source speakers

 

source
baseline
baseline+adv
proposed
proposed-adv