High-Fidelity Music Vocoder using Neural Audio Codecs

Luca A. Lanzendörfer, Florian Grötschla, Michael Ungersböck, and Roger Wattenhofer

ETH Zurich

Mel Spectrogram Difference

Comparison of MTG-Jamendo1 samples. The difference between the original and reconstructed mel spectrogram is shown.

Ground Truth

Original Music Mel Spectrogram 0

HiFi-GAN

HiFi-GAN Music Mel Spectrogram 0

BigVGAN

BigVGAN Music Mel Spectrogram 0

BigVGAN-v2

BigVGAN-v2 Music Mel Spectrogram 0

DisCoder

DisCoder Music Mel Spectrogram 0
Original Music Mel Spectrogram 1
HiFi-GAN Music Mel Spectrogram 1
BigVGAN Music Mel Spectrogram 1
BigVGAN-v2 Music Mel Spectrogram 1
DisCoder Music Mel Spectrogram 1
Original Music Mel Spectrogram 2
HiFi-GAN Music Mel Spectrogram 2
BigVGAN Music Mel Spectrogram 2
BigVGAN-v2 Music Mel Spectrogram 2
DisCoder Music Mel Spectrogram 2
Original Music Mel Spectrogram 3
HiFi-GAN Music Mel Spectrogram 3
BigVGAN Music Mel Spectrogram 3
BigVGAN-v2 Music Mel Spectrogram 3
DisCoder Music Mel Spectrogram 3

Music Synthesis

Comparison of MTG-Jamendo1 and MUSDB18-HQ2 samples.

Ground Truth

HiFi-GAN

BigVGAN

BigVGAN-v2

DisCoder

Speech Synthesis

DisCoder shows competitive performance on speech samples taken from LibriTTS3.

Ground Truth

HiFi-GAN

BigVGAN

BigVGAN-v2

DisCoder

1https://mtg.github.io/mtg-jamendo-dataset/
2https://sigsep.github.io/datasets/musdb.html
3https://www.openslr.org/60/