StyleSinger: Style Transfer for Out-Of-Domain Singing Voice Synthesis

Abstract

Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.

Parallel Style Transfer Samples

In the context of out-of-domain (OOD) scenarios, where the content of the reference voice remains unchanged.

1.Reference/Target: 风花雪月的诗句里我在年年的成长 AP

global style class: tenor happy

Reference
Ref(vocoder) Styler GenerSpeech
YourTTS MS RMSSinger StyleSinger

2.Reference/Target: 为春风吹落 AP SP AP 只是简简单单的爱过我还

global style class: alto sad

Reference
Ref(vocoder) Styler GenerSpeech
YourTTS MS RMSSinger StyleSinger

3.Reference/Target: 我还在逞强 AP 说着谎 AP 也没能力遮挡你去的方向

global style class: alto sad

Reference
Ref(vocoder) Styler GenerSpeech
YourTTS MS RMSSinger StyleSinger

Non-Parallel Style Transfer Samples

In out-of-domain (OOD) scenarios, we utilize unseen reference audio with target notes and lyrics to synthesize the target singing voice.

1.Reference: 风花雪月的诗句里我在年年的成长 AP

global style class: tenor happy

Effectively capturing the timbre, emotion, pitch transitions, vocal techniques, and delicate articulation skills.

Ref(vocoder)

Target: 吹向我脸庞 AP 想起你轻柔的话语 AP 曾打湿我眼眶 AP

Styler GenerSpeech YourTTS
MS RMSSinger StyleSinger

2.Reference: 为春风吹落 AP SP AP 只是简简单单的爱过我还

global style class: alto sad

Successfully transferring the timbre, emotion, vocal techniques, and subtle elongation in articulation skills.

Ref(vocoder)

Target: 幸福没有那么容易 AP 才会特别让人着迷 AP 什么都不懂的年纪 AP

Styler GenerSpeech YourTTS
MS RMSSinger StyleSinger

3.Reference: 圈圈圆圆圈圈天天年年天天的我深深看你的脸 AP 生气的温柔埋怨的温柔 SP

global style class: alto sad

Successfully transferring the timbre, emotion, vocal techniques, vibrato skill, and subtle elongation in articulation skill.

Ref(vocoder)

Target: 修炼爱情的心酸 AP 学会放好以前的渴望 AP 我们那些信仰要忘记多难

Styler GenerSpeech YourTTS
MS RMSSinger StyleSinger

Ablation Studies

We undertake ablation studies to showcase the efficacy of various designs incorporated within StyleSinger. UMLN and RSA are the Uncertainty Modeling Layer Normalization and the Residual Style Adaptor, while Pitch and Decoder mean the pitch diffusion predictor and the diffusion decoder. VQ means using VQ instead of RQ, and MSLN means using Mix-Style LN instead of UMLN.

1.Reference/Target: 而鲜血如红唇 AP 前朝记忆渡红尘 AP 伤人的不是刀刃 AP

global style class: alto sad

Reference
Ref(vocoder) StyleSinger wo UMLN
wo RSA wo Pitch wo Decoder
VQ MSLN MS RMSSinger

This page was generated by GitHub Pages.