Ever wish you could automatically dub foreign film dialogue into another tongue? Amazon’s on the case.
In a paper published this week on the preprint server Arxiv.org, researchers from the tech giant detailed a novel “Speech-to-speech” pipeline that taps AI to align translated speech with original speech and fine-tune speech duration before adding background noise and reverberation.
As the paper’s coauthors note, automatic dubbing involves transcribing speech to text and translating that text into another language before generating speech from the translated text.
It comprises several parts, including a Transformer-based machine translation bit trained on over 150 million English-Italian pairs and a prosodic alignment module that computes the relative match in duration between speech segments while measuring the linguistic plausibility of pauses and breaks.
A model in the text-to-speech phase trained on 47 hours of speech recordings generates a context sequence from text that’s fed into a pretrained vocoder, which converts the sequence into a speech waveform.
To make the dubbed speech sound more “Real” and similar to the original, the team incorporated a foreground-background separation step that extracted background noise and added it to the speech.
A separate step – a re-reverberation step – estimates the environment reverberation from the original audio and applies it to the dubbed audio.
This article was summarized automatically with AI / Article-Σ ™/ BuildR BOT™.