Spatio-Temporal Attention-Based LSTM Networks for
3D Action Recognition and Detection


Abstract

Human action analytics has attracted a lot of attention for decades in computer vision. It is important to extract discriminative spatio-temporal features to model the spatial and temporal evolutions of different actions. In this work, we propose a spatial and temporal attention model to explore the spatial and temporal discriminative features for human action recognition and detection from skeleton data. We build our networks based on Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. The learned model is capable of selectively focusing on discriminative joints of skeletons within each input frame and paying different levels of attention to the outputs of different frames. To ensure effective training of the network for action recognition, we propose a regularized cross-entropy loss to drive the learning process and develop a joint training strategy accordingly. \M{Moreover, based on temporal attention, we develop a method to generate action temporal proposals for action detection. We evaluate the proposed method on the SBU Kinect Interaction Dataset, the NTU RGB+D Dataset and the PKU-MMD Dataset, respectively. Experiment results demonstrate the effectiveness of our proposed model on both action recognition and action detection.

Action Recognition

Method & Results

For the detailed method and results, please refer to Spatio-Temporal Attention Model for Action Recognition (AAAI 2017)

More Analysis

Fig.1 Visualization of spatial attention (top subfigures) and temporal attention (bottom subfigures) on actions of (a) sitting down, (b) kicking, (c) crossing hands, and (d) making a phone call (NTU-CS). Horizontal axis denotes the frame indexes (time).

Fig.2 Distributionn of most engaged joints to different actions under various parameter settings (NTU-CS).

Fig.3 Temporal attention curves on action drinking water under different parameter settings.

Action Detection

Method

Teaser

Fig.4 (a) Illustration for action detection (b) Architecture of the action detection framework.

Results

Table 1. Comparison of different methods in mAP on the PKUMMD.


Table 2. Comparison of different methods in F1-Score on the PKUMMD.

Citation

@article{song2018spatio,   title={Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection},   author={Song, Sijie and Lan, Cuiling and Xing, Junliang and Zeng, Wenjun and Liu, Jiaying},   journal ={IEEE Transactions on Image Processing},   volume={27},   number={7},   pages={3459--3471},   year={2018} }

Reference

1. Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, J. Liu. Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks. ECCV2016.