The analysis of the ubiquitous human-human interactions is pivotal for understanding humans as social beings. Existing human-human interaction datasets typically suffer from inaccurate body motions, lack of hand gestures and fine-grained textual descriptions.
To better perceive and generate human-human interactions, we propose Inter-X, a currently largest human-human interaction dataset with accurate body movements and diverse interaction patterns, together with detailed hand gestures. The dataset includes ~11K interaction sequences and more than 8.1M frames. We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions, semantic interaction categories, interaction order, and the relationship and personality of the subjects. Based on the elaborate annotations, we propose a unified benchmark composed of 4 categories of downstream tasks from both the perceptual and generative directions.
Extensive experiments and comprehensive analysis show that Inter-X serves as a testbed for promoting the development of versatile human-human interaction analysis.
Inter-X is a large-scale dataset containing ~11K interaction sequences, more than 8.1M frames and 34K fine-grained human part level textual descriptions. Below are some characteristics and some samples of the dataset.
“One person opens his/her arms and walks towards the other person, embracing him/her, while the other person reciprocates the hug by also opening his/her arms. After they embrace, both individuals step back.”
“Two individuals are positioned opposite each other and proceed to slowly lift their right hands towards one another. They seize hold of each other's right hands and proceed to shake them in an upward and downward motion a few times. Following this, they both simultaneously lower their hands.”
“Two people face each other, with one person using his/her left hand to pull the other person's right arm while stepping backward, guiding the other person forward.”
“One individual strikes the left leg of the other individual using his/her right foot, while the individual who was struck slightly flexes his/her left knee and rotates towards the right.”
“A single individual occupies a seat, with both hands placed on the armrests of the chair, while the second individual stands at the rear. The standing individual's head is slightly inclined towards the upper left of the seated individual's head, and his/her hands are elevated in a gesture symbolizing peace or joy.”
“One person walks up to another person and raises both of his/her hands. When the second person stands in front of the first person, he/she pushes the first person, causing him/her to move backward.”
“One person sits on the left side of the seated person, placing his/her weight on the seated person's left thigh, with both hands resting naturally on the thigh.”
“A single individual approaches another individual directly and collides his/her left shoulder with the left shoulder of the other person, after which they both pivot to confront each other.”
“One person forcefully steps on the right foot of the other person with his/her left foot. The person whose foot is stepped on bends down, lifts his/her right foot with both hands, and holds it in pain.”
“One person stands in front of the other person and starts jogging. The other person, upon seeing this, stands up and begins jogging counterclockwise with his/her hands hanging naturally. The first person continues to chase while raising his/her arms. Eventually, they both slow down, and the second person pats his/her own chest with his/her right hand.”
“One person places his/her right hand on the other person's shoulder and his/her left hand near his/her left ear, as if whispering something. The other person, surprised by what he/she hears, takes a step back and places both hands on his/her chest.”
“One individual aids the other by stretching out both arms, assisting him/her in moving ahead while he/she seems to be hobbling and requiring aid.”
Human part level textual descriptions with 1) the coarse body movements, 2) the finger movements, and 3) the relative orientations.
@article{xu2023inter,
title={Inter-X: Towards Versatile Human-Human Interaction Analysis},
author={Xu, Liang and Lv, Xintao and Yan, Yichao and Jin, Xin and Wu, Shuwen and Xu, Congsheng and Liu, Yifan and Zhou, Yizhou and Rao, Fengyun and Sheng, Xingdong and Liu, Yunhui and Zeng, Wenjun and Yang, Xiaokang},
journal={arXiv preprint arXiv:2312.16051},
year={2023}
}