Obtaining strong reproducible foundation language-audio models require open datasets of sufficient scale and quality. To pre-train contrastive language-audio model we compose large-scale sound effects dataset with detailed text descriptions for each sample. Generating music, as a special type of audio, presents further challenges due to limited availability of music-text pairs with expressive enough captions. We show here how we combine various composed datasets to pre-train a large-scale audio-language contrastive model (CLAP). Then we train, on music samples we collected, a state-of-the-art text-to-music model, MusicLDM, that adapts AudioLDM, which is based on Stable Diffusion architecture, to the music domain, by utilizing pre-trained CLAP model and the Hifi-GAN vocoder, as components of MusicLDM. The modelling work validates thus composed textaudio and text-music datasets as strong basis for further studies on language-rooted
foundation models for audio at larger scales.

Citation:

Marianna Nezhurina et al., “Composing and validating Large-Scale datasets for training open foundation models for audio,” journal-article. Online. Available: https://mlforaudioworkshop.com/CompValDataFoundationModels.pdf

More information:

Open source: https://mlforaudioworkshop.com/CompValDataFoundationModels.pdf