Inter-X: Towards Versatile Human-Human Interaction Analysis

Liang Xu1,2, Xintao Lv1, Yichao Yan1, Xin Jin2, Shuwen Wu1, Congsheng Xu1, Yifan Liu1, Yizhou Zhou3, Fengyun Rao3, Xingdong Sheng4, Yunhui Liu4, Wenjun Zeng2, Xiaokang Yang1
1Shanghai Jiao Tong University, 2Eastern Institute of Technology, Ningbo, 3WeChat, Tencent Inc., 4Lenovo.
CVPR 2024
Interpolate start reference image.

Figure 1. Inter-X is a large-scale human-human interaction MoCap dataset with ~11K interaction sequences and more than 8.1M frames. The fine-grained textual descriptions, semantic action categories, interaction order, and relationship and personality annotations allow for 4 categories of downstream tasks.


Interpolate start reference image.

Abstract

The analysis of the ubiquitous human-human interactions is pivotal for understanding humans as social beings. Existing human-human interaction datasets typically suffer from inaccurate body motions, lack of hand gestures and fine-grained textual descriptions.

To better perceive and generate human-human interactions, we propose Inter-X, a currently largest human-human interaction dataset with accurate body movements and diverse interaction patterns, together with detailed hand gestures. The dataset includes ~11K interaction sequences and more than 8.1M frames. We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions, semantic interaction categories, interaction order, and the relationship and personality of the subjects. Based on the elaborate annotations, we propose a unified benchmark composed of 4 categories of downstream tasks from both the perceptual and generative directions.

Extensive experiments and comprehensive analysis show that Inter-X serves as a testbed for promoting the development of versatile human-human interaction analysis.

Inter-X Dataset

Inter-X is a large-scale dataset containing ~11K interaction sequences, more than 8.1M frames and 34K fine-grained human part level textual descriptions. Below are some characteristics and some samples of the dataset.

Interpolate start reference image.

Figure 2. Our proposed Inter-X dataset for human-human interaction analysis is highly accurate, hand gestures incorporated, with diverse actions and reactions.

SMPL-X Interaction Sequences

Textual Descriptions

Human part level textual descriptions with 1) the coarse body movements, 2) the finger movements, and 3) the relative orientations.

BibTeX

@article{xu2023inter,
        title={Inter-X: Towards Versatile Human-Human Interaction Analysis},
        author={Xu, Liang and Lv, Xintao and Yan, Yichao and Jin, Xin and Wu, Shuwen and Xu, Congsheng and Liu, Yifan and Zhou, Yizhou and Rao, Fengyun and Sheng, Xingdong and Liu, Yunhui and Zeng, Wenjun and Yang, Xiaokang},
        journal={arXiv preprint arXiv:2312.16051},
        year={2023}
      }