A Fully Zero-Shot Approach to Obtaining Specialized and Compact Audio Tagging Models

Zero-shot classifiers based on Contrastive Language-Audio Pretraining (CLAP) models enable classification of given audio into classes defined at test time using text. These models are costly to run with respect to computation and memory requirements. In this work, we propose to build a specialized low-resource classifier for classes pre-defined using text, using a two-stage procedure consisting of zero-shot data set pruning and model compression. First, relevant in-domain data is selected from a source dataset using class label embeddings obtained from a pre-trained CLAP model. This data is then used to distill the audio encoder of a CLAP model. The proposed compression method produces compact audio encoders with slightly reduced accuracy. Note that neither labeled nor unlabeled in-domain audio data is required for its development. We verify by cross-dataset tests that the resulting classifiers are indeed specialized to their task.

Zitation:

Werning, A., & Häb-Umbach, R. (2025). A Fully Zero-Shot Approach to Obtaining Specialized and Compact Audio Tagging Models. In S. Möller, T. Gerkmann, & D. Kolossa (Eds.), Proceedings of the 16th ITG Conference on Speech Communication (pp. 78–82).

Mehr Information:

https://ris.uni-paderborn.de/record/62163https://doi.org/10.48550/ARXIV.2509.08828