Saltar a contenido

Referencias

Papers fundamentales (sólo estudio arquitectónico; no copiar pesos)

Paper Aporte clave
SoundStream (Zeghidour et al. 2021, arXiv:2107.03312) Primer codec neural con RVQ
Encodec (Défossez et al. 2022, arXiv:2210.13438) SEANet + LSTM, MS-STFT disc, loss balancer, 32 codebooks × 1024 @ 75 Hz, dead-code restart
MusicLM (Agostinelli et al. 2023, arXiv:2301.11325) Pipeline 3 etapas MuLan→semantic→acoustic, 280 K h
AudioLM (Borsos et al. 2022, arXiv:2209.03143) Tokens semánticos (w2v-BERT) + acoustic (SoundStream), cascada 3 transformers
MusicGen / AudioCraft (Copet et al. 2023, arXiv:2306.05284) Single-stage AR sin tokens semánticos, patrones delay/parallel/flat, CFG, 300 M / 1.5 B / 3.3 B
Stack-and-Delay (Le Lan et al. 2023, arXiv:2309.08804) Mejora del delay pattern
Stable Audio (Evans et al. ICML 2024, arXiv:2402.04825) Latent diffusion + timing embeddings, 95 s 44.1 kHz estéreo
Stable Audio 2 (Evans et al. 2024, arXiv:2404.10301) DiT + VAE compresivo → canciones de 4'45'' completas
Stable Audio Open (Evans et al. 2024, arXiv:2407.14358) Versión open-weights DiT
Jukebox (Dhariwal et al. 2020, arXiv:2005.00341) Primer modelo con voz coherente, VQ-VAE jerárquico, condicionamiento por género+artista+letra
AudioLDM / LDM 2 (Liu et al. 2023) Latent diffusion sobre mel + CLAP, AudioMAE universal
DiffSinger (Liu et al. AAAI 2022, arXiv:2105.02446) Acoustic SVS DDPM mel + shallow diffusion mechanism
FastSpeech 2 (Ren et al. 2020) FFT encoder + variance adaptor (dur/pitch/energy)
HiFi-GAN (Kong et al. NeurIPS 2020) Generator MRF + MPD periodos 2,3,5,7,11 + MSD
VISinger 2 (Zhang et al. 2023, arXiv:2211.02903) End-to-end VITS + DSP synthesizer (sin/aperiodic)
NaturalSpeech 2/3 (Shen / Ju et al. 2023-2024) Latent diffusion + FACodec factorizado
DAC (Kumar et al. NeurIPS 2023) Descript Audio Codec, mejora sobre Encodec
BigVGAN / VOCOS Vocoders SOTA 2024
AudioSeal (Roman et al. ICML 2024) Proactive localized watermarking
SilentCipher (Singh et al. Interspeech 2024) Deep audio watermarking
REMI / CP / MMM (Huang & Yang 2020; Hsiao et al. 2021) Tokenizadores MIDI eficientes

Repos open-source (sólo estudio)

  • facebookresearch/audiocraft (MusicGen, EnCodec, AudioGen; MIT, modelos CC-BY-NC).
  • Stability-AI/stable-audio-tools.
  • openai/jukebox.
  • NVIDIA/NeMo.
  • coqui-ai/TTS.
  • MoonInTheRiver/DiffSinger y openvpi/DiffSinger (versión activa 2024-25).
  • descript/audiotools.
  • csteinmetz1/pyloudnorm.
  • spotify/pedalboard.
  • facebookresearch/audioseal.
  • gudgud96/frechet-audio-distance.
  • microsoft/fadtk.
  • RVC-Project/RVC (sólo arquitectura, no usar pesos).

Comunidades / foros

  • r/MachineLearning, r/LocalLLaMA (subhilos audio).
  • Discord AudioCraft / Stability Harmonai / AI Audio Research.
  • Papers With Code: tasks Audio Generation, Singing Voice Synthesis.
  • Proceedings ISMIR / ICASSP / Interspeech (open access).