Online Human Action Detection using Joint Classification-Regression
Recurrent Neural Networks

Fig.1 Illustration of online action detection. It aims to determine the action type and the localization on the fly. It is also desirable to forecast the start and end points (e.g., T frames ahead).

Abstract

In this paper, we study the problem of online action detection from streaming skeleton data. We propose a multi-task end-to-end Joint Classification-Regression Recurrent Neural Network to better explore the action type and temporal localization information. By employing a joint classification and regression optimization objective, this network is capable of automatically localizing the start and end points of actions more accurately. Specifically, by leveraging the merits of the deep Long Short-Term Memory (LSTM) subnetwork, the proposed model automatically captures the complex long-range temporal dynamics, which naturally avoids the typical sliding window design and thus ensures high computational efficiency. Furthermore, the subtask of regression optimization provides the ability to forecast the action prior to its occurrence. Fig.2 shows the architcutre of the proposed joint classification-regression RNN framework. To evaluate our proposed model, we build a large streaming video dataset with annotations. Experimental results on our dataset and the public G3D dataset both demonstrate very promising performance of our scheme.

Fig. 2. Architecture of the proposed joint classification-regression RNN framework for online action detection and forecasting.

Demo

Download

Paper: arXiv

Supplementary Material: pdf

Dataset: [Direct] [Google Drive] (~53GB. The Online Action Detection Dataset (OAD) was captured using the Kinect V2 sensor, which collects color images, depth images and human skeleton joints synchronously. Our dataset includes 59 long sequences and 10 actions)

Citation

@article{li2016online, title={Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks}, author={Li, Yanghao and Lan, Cuiling and Xing, Junliang and Zeng, Wenjun and Yuan, Chunfeng and Liu, Jiaying}, journal={European Conference on Computer Vision}, year={2016} }