Wei Ping - Deep Voice 3: 2000-Speaker Neural Text-to-Speech (2017)

History / Edit / PDF / EPUB / BIB /
Created: December 7, 2017 / Updated: February 6, 2021 / Status: finished / 2 min read (~285 words)

  • Attention Error Modes: mispronunciations, skipped words, and repeated words

  • A fully-convolutional character-to-spectrogram architecture, which enables fully paralleled computation over elements in a sequence and trains an order of magnitude faster than analogous architectures using recurrent cells
  • In contrast to Deep Voice 1 & 2, Deep Voice 3 employs an attention-based sequence-to-sequence model, yielding a more compact architecture
  • Deep Voice 3 avoids RNNs to speed up training and alleviates several challenging error modes that attention models fall into

  • Deep Voice 3 architecture consists of three components:
    • Encoder: A fully-convolutional encoder, which converts textual features to an internal learned representation
    • Decoder: A fully-convolutional causal decoder, which decodes the learned representation with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-band spectrograms) in an auto-regressive manner
    • Converter: A fully-convolutional post-processing network, which predicts final output features (depending on the waveform synthesis method) from the decoder hidden states. Unlike the decoder, the converter is non-causal and can thus depend on future context information

  • Uppercase all characters in the input text
  • Remove all intermediate punctuation marks
  • End every utterance with a period or question mark
  • Replace spaces between words with special separator characters which indicate the duration of pauses inserted by the speaker between words

  • Fast Training
    • The average training iteration time is 0.06 seconds using one GPU as opposed to 0.59 seconds for Tacotron
    • Converges after ~500K iterations for all three datasets in our experiment, while Tacotron requires ~2M iterations
    • This significant speedup is due to the fully-convolutional architecture of Deep Voice 3, which highly exploits the parallelism of a GPU during training

  • Ping, Wei, et al. "Deep Voice 3: 2000-Speaker Neural Text-to-Speech." arXiv preprint arXiv:1710.07654 (2017).