Real-time understanding of dynamic human presence is crucial for immersive Augmented Reality (AR), yet challenging on resource-constrained Head-Mounted Displays (HMDs). This paper introduces Rhino-AR, a pipeline for ondevice 3D human pose estimation and dynamic scene...
The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for...
Modern methods for fine-tuning a Vision Transformer (ViT) like Low-Rank Adaptation (LoRA) and its variants demonstrate impressive performance. However, these methods ignore the high-dimensional nature of Multi-Head Attention (MHA) weight tensors. To address this...
With cancer being a leading cause of death globally, epidemiological and clinical cancer registration is paramount for enhancing oncological care and facilitating scientific research. However, the heterogeneous landscape of medical data presents significant challenges...
State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity, making them attractive for various applications. However, purely SSM-based models face critical challenges related to stability and...
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters)...