Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

Songxiang Liu, Shan Yang, Dan Su, Dong Yu ² Tencent AI Lab

Introduction

Cross-speaker style transfer (CSST) in neural text-to-speech (TTS) synthesis aims at transferring a speaking style to the synthesised speech in a target speaker's voice. Most previous CSST methods in TTS require a reference utterance conveying the desired style to obtain speaking style descriptors as conditioning on the generation of a new sentence, which hinders their practical use in real-world applications. This work presents Referee, a robust reference-free CSST approach for expressive TTS. Phonetic PosteriorGram (PPG), phoneme-level pitch and energy contours are adopted as fine-grained speaking style descriptors, which are predicted from text using a text-to-style (T2S) model. We design a novel pretrain-refinement method to learn a robust T2S model by leveraging readily accessible low-quality speech data. A style-to-wave (S2W) model is trained with high-quality target data, which is adopted to effectively aggregate style descriptors and generate high-fidelity speech in the target speaker's voice. Referee is then built by cascading the T2S model with the S2W model. Experimental results are given, showing that Referee outperforms a global-style-token (GST)-based baseline system in CSST.