VIHand: Enhancing 3D Hand Pose Estimation with Visual-Inertial Benchmark

Xinyi Wang1     Pengfei Ren2     Haoyang Zhang3     Xin Sheng4   Da Li5     Liang Xie3     Yue Gao1   Erwei Yin1, 3
1Shanghai Jiao Tong University   2Beijing University of Posts and Telecommunications   3Defense Innovation Institute, Academy of Military Sciences   4Tianjin University   5Nankai University



Dataset Overview

VIHand is the first large-scale glove-worn dataset for visual-inertial hand pose estimation, containing over 1.4 million synchronized RGB-D and IMU frames from 15 subjects. It provides accurate frame-level 3D joint annotations across complex gestures, enabling comprehensive research in HPE tasks, including multimodal fusion, cross-modal knowledge transfer and cross-modal data generation, etc.

Abstract

Accurate and robust 3D hand pose estimation (HPE) plays a crucial role in human-computer interaction. Existing 3D HPE solutions predominantly rely on vision-based or inertial measurement units (IMUs)-based methods. Vision-based methods benefit from rich appearance information for high-accuracy HPE but are sensitive to field of view (FoV), occlusion, motion blur and lighting. IMU-based methods can operate immune to optical sensitivity and FoV constraints but remain vulnerable to cumulative integration errors and drift. Given their complementary strengths, combining dual modalities offers a promising direction for HPE in complex environments. However, the lack of large-scale visual-inertial datasets has limited progress in this area. In this paper, we construct VIHand, the first large-scale glove-worn dataset for visual-inertial HPE, comprising over 1.4 million synchronized RGB-D and IMU frames from 15 subjects. It enables comprehensive research in HPE tasks, such as multimodal fusion and cross-modal knowledge transfer. Building on VIHand, we propose visual-inertial fusion network (VIFNet) for dual-modalities estimation, and its distilled student model (VIFNet-S) for IMU-only evaluation. Experimental results reveal that integrating visual and inertial modalities significantly improves the accuracy and robustness of 3D HPE, particularly under occlusion and motion blur. In IMU-only inference even sparse IMU configurations, models distilled from visual-inertial supervision achieve substantial performance gains, enabling robust HPE for challenging optical sensitive scenarios.

Capture System

VIHand was collected using a synchronized multi-sensor system, consisting of 5 Intel RealSense D415i cameras and a data glove equipped with 7 IMU sensors. 5 cameras were placed at diverse viewpoints around the hand to synchronously capture high-resolution RGB-D stream at 1280*720 resolution and 30 FPS. To capture the dynamic inertial data of hand motion, 7 IMUs were embedded in the data glove at the wrist, palm, thumb, index, middle finger, ring finger and little finger.

capture system

Benchmark

Overall framework, The fusion model (VIFNet) takes a monocular RGB and IMU data as input, extracts spatial anc temporal features, and fuses them via cross attention mechanism, The fused features are then decoded by the MANO head to estimate 3D hand pose. The student model (VIFNet-S), trained on lMU data alone, learns to mimic the VIFNet's fused features representation through knowledge distillation.

VIHand Architecture Diagram

Results

Qualitative results validate that our VIFNet produces much better results than previous methods on VIHand. This demonstrates that integrating visual and inertial modalities significantly improves the accuracy and robustness of 3D HPE, particularly under occlusion and motion blur scenarios.

Qualitative Result 1

Dataset Directory Structure

            ${ROOT}
            ├── multidata
            │   ├── subject1 ~ subject15           # 15 subjects
            │   │   ├── ROM01 ~ ROM15              # 15 gesture sequences
            │   │   │   ├── camera0 ~ camera4      # 5 camera views
            │   │   │   │   ├── rgb                # RGB sequences
            │   │   │   │   ├── depth              # Depth map sequences
            │   │   │   │   ├── bbox.json          # bounding box
            │   │   │   │   └── center.json        # Depth map center
            │   │   │   │── IMU.json               # IMU sequences
            │   │   │   └── camera_paras.json      # camera parameters
            │   │   │...
            │   │...
            ├── annotations
            │   ├── subject1 ~ subject15
            │   │   ├── ROM01 ~ ROM15
            │   │   │   ├── joint.json
            │   │   │   ├── mano.json
            │   │   │   └── mesh.json
            │   │   │...
            │   │...
                

Note: The above directory structure is a representation of how the dataset is organized. Each directory and file serves a specific purpose in the dataset's organization.

Dataset Details

Frame Distribution by Category

Category Total Frames
Train Set
1.06M
Test Set
340K
Total
1.4M
Subject ID Gender

Data Sample

Data Sample

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Nos. 62332019, 62406039), the National Key R&D Program of China (Nos. 2023YFF1203900, 2023YFF1203903), the China Postdoctoral Science Foundation (Nos. 2023TQ0039, 2024M750257, GZC20230320), and the Beijing Nova Program (No. 20240484513)