Given spectrographic representations of source and target speakers’ voices, the model learns to mimic the target speaker’s voice quality and style, regardless of the linguistic content of either’s voice, generating a synthetic spectrogram from which the time-domain signal is reconstructed using the Griffin-Lim method.

In the absence of mechanisms to isolate, quantify and measure these, generative approaches such as source-filter models within which explicit components of the source of filter can be modified , spectral-transformation models such as PSOLA , etc.

The basic GAN model The Generative Adversarial Network is a generative model which, at its foundation, is a generative model for a data variable.

These models differ from conventional generative models in a fundamental way in the manner in which they are learned.

Conventional generative models are trained through likelihood maximization criteria, such that some divergence measure between the synthetic distribution encoded by the generative models, and the true distribution of the data, is minimized.

Style Embedding Model In addition to the discriminator that distinguishes between the generated data and real data, we add a second type of discriminator to our model to further extract the target style information from input data and to make sure that the generated data still has this style information embedded in it.

Model implementation The model architecture is that of the VoiceGAN described above.

This article was summarized automatically with AI / Article-Σ ™/ BuildR BOT™.

Original link