| SoundStream (Zeghidour et al. 2021, arXiv:2107.03312) |
Primer codec neural con RVQ |
| Encodec (Défossez et al. 2022, arXiv:2210.13438) |
SEANet + LSTM, MS-STFT disc, loss balancer, 32 codebooks × 1024 @ 75 Hz, dead-code restart |
| MusicLM (Agostinelli et al. 2023, arXiv:2301.11325) |
Pipeline 3 etapas MuLan→semantic→acoustic, 280 K h |
| AudioLM (Borsos et al. 2022, arXiv:2209.03143) |
Tokens semánticos (w2v-BERT) + acoustic (SoundStream), cascada 3 transformers |
| MusicGen / AudioCraft (Copet et al. 2023, arXiv:2306.05284) |
Single-stage AR sin tokens semánticos, patrones delay/parallel/flat, CFG, 300 M / 1.5 B / 3.3 B |
| Stack-and-Delay (Le Lan et al. 2023, arXiv:2309.08804) |
Mejora del delay pattern |
| Stable Audio (Evans et al. ICML 2024, arXiv:2402.04825) |
Latent diffusion + timing embeddings, 95 s 44.1 kHz estéreo |
| Stable Audio 2 (Evans et al. 2024, arXiv:2404.10301) |
DiT + VAE compresivo → canciones de 4'45'' completas |
| Stable Audio Open (Evans et al. 2024, arXiv:2407.14358) |
Versión open-weights DiT |
| Jukebox (Dhariwal et al. 2020, arXiv:2005.00341) |
Primer modelo con voz coherente, VQ-VAE jerárquico, condicionamiento por género+artista+letra |
| AudioLDM / LDM 2 (Liu et al. 2023) |
Latent diffusion sobre mel + CLAP, AudioMAE universal |
| DiffSinger (Liu et al. AAAI 2022, arXiv:2105.02446) |
Acoustic SVS DDPM mel + shallow diffusion mechanism |
| FastSpeech 2 (Ren et al. 2020) |
FFT encoder + variance adaptor (dur/pitch/energy) |
| HiFi-GAN (Kong et al. NeurIPS 2020) |
Generator MRF + MPD periodos 2,3,5,7,11 + MSD |
| VISinger 2 (Zhang et al. 2023, arXiv:2211.02903) |
End-to-end VITS + DSP synthesizer (sin/aperiodic) |
| NaturalSpeech 2/3 (Shen / Ju et al. 2023-2024) |
Latent diffusion + FACodec factorizado |
| DAC (Kumar et al. NeurIPS 2023) |
Descript Audio Codec, mejora sobre Encodec |
| BigVGAN / VOCOS |
Vocoders SOTA 2024 |
| AudioSeal (Roman et al. ICML 2024) |
Proactive localized watermarking |
| SilentCipher (Singh et al. Interspeech 2024) |
Deep audio watermarking |
| REMI / CP / MMM (Huang & Yang 2020; Hsiao et al. 2021) |
Tokenizadores MIDI eficientes |