We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters)...
Efficient physics simulations are essential for numerous applications, ranging from realistic cloth animations or smoke effects in video games, to analyzing pollutant dispersion in environmental sciences, to calculating vehicle drag coefficients in engineering...
In this technical report, we describe our submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification [1]. We adopt the baseline model and add a specialised knowledge distillation process before proceeding with the baseline training process. Our...
Despite the rising popularity of message-passing neural networks (MPNNs), their ability to fit complex functions over graphs is limited as node representations become more similar with increasing depth—a phenomenon known as over-smoothing. Most approaches to mitigate...
Safe navigation of self-driving cars and robots requires a precise understanding of their environment. Training data for perception systems cannot cover the wide variety of objects that may appear during deployment. Thus, reliable identification of unknown objects,...
Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image’s visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for...