论文翻译: A Comprehensive Study of Deep Video Action Recognition

A Comprehensive Study of Deep Video Action Recognition


Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling longrange temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video actin recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.



One of the most important tasks in video understanding is to understand human actions. It has many real-world applications, including behavior analysis, video retrieval, human-robot interaction, gaming, and entertainment. Human action understanding involves recognizing, localizing, and predicting human behaviors. The task to recognize human actions in a video is called video action recognition. In Figure 1, we visualize several video frames with the associated action labels, which are typical human daily activities such as shaking hands and riding a bike.


Over the last decade, there has been growing research interest in video action recognition with the emergence of high-quality large-scale action recognition datasets. We summarize the statistics of popular action recognition datasets in Figure 2. We see that both the number of videos and classes increase rapidly, e.g, from 7K videos over 51 classes in HMDB51 [109] to 8M videos over 3,862 classes in YouTube8M [1]. Also, the rate at which new datasets are released is increasing: 3 datasets were released from 2011 to 2015 compared to 13 released from 2016 to 2020.


Thanks to both the availability of large-scale datasets and the rapid progress in deep learning, there is also a rapid growth in deep learning based models to recognize video actions. In Figure 3, we present a chronological overview of recent representative work. DeepVideo [99] is one of the earliest attempts to apply convolutional neural networks to videos. We observed three trends here. The first trend started by the seminal paper on Two-Stream Networks [187], adds a second path to learn the temporal information in a video by training a convolutional neural network on the optical flow stream. Its great success inspired a large number of follow-up papers, such as TDD [214], LRCN [37], Fusion [50], TSN [218], etc. The second trend was the use of 3D convolutional kernels to model video temporal information, such as I3D [14], R3D [74], S3D [239], Non-local [219], SlowFast [45], etc. Finally, the third trend focused on computational efficiency to scale to even larger datasets so that they could be adopted in real applications. Examples include Hidden TSN [278], TSM [128], X3D [44], TVN [161], etc.


Despite the large number of deep learning based models for video action recognition, there is no comprehensive survey dedicated to these models. Previous survey papers either put more efforts into hand-crafted features [77, 173] or focus on broader topics such as video captioning [236], video prediction [104], video action detection [261] and zero-shot video action recognition [96]. In this paper:


  • We comprehensively review over 200 papers on deep learning for video action recognition. We walk the readers through the recent advancements chronologically and systematically, with popular papers explained in detail.
  • We benchmark widely adopted methods on the sameset of datasets in terms of both accuracy and efficiency. We also release our implementations for full reproducibility.
  • We elaborate on challenges, open problems, and opportunities in this field to facilitate future research.
  • 我们全面调研了200多篇基于深度学习的视频动作识别论文,我们按时间顺序系统地引导读者浏览这个领域的最新进展,并详细解释了一些杰出的论文。
  • 我们使用相同的数据集对这些方法的准确性和效率进行基准测试。我们还发布了可完全复现实验结论的实现代码。
  • 我们详细介绍了该领域中的挑战,已知问题和未来发展机遇,以促进未来的研究。

The rest of the survey is organized as following. We first describe popular datasets used for benchmarking and existing challenges in section 2. Then we present recent advancements using deep learning for video action recognition in section 3, which is the major contribution of this survey. In section 4, we evaluate widely adopted approaches on standard benchmark datasets, and provide discussions and future research opportunities in section 5.


Datasets and Challenges(数据集和挑战)


Deep learning methods usually improve in accuracy when the volume of the training data grows. In the case of video action recognition, this means we need large-scale annotated datasets to learn effective models.


For the task of video action recognition, datasets are often built by the following process: (1) Define an action list, by combining labels from previous action recognition datasets and adding new categories depending on the use case. (2) Obtain videos from various sources, such as YouTube and movies, by matching the video title/subtitle to the action list. (3) Provide temporal annotations manually to indicate the start and end position of the action, and (4) finally clean up the dataset by de-duplication and filtering out noisy classes/samples. Below we review the most popular large-scale video action recognition datasets in Table 1 and Figure 2.


Table 1. A list of popular datasets for video action recognition

Dataset Year # Samples Ave. Len # Actions
HMDB51 [109] 2011 7K ~5s 51
UCF101 [190] 2012 13.3K ~6s 101
Sports1M [99] 2014 1.1M ~5.5m 487
ActivityNet [40] 2015 28K [5,10]m 200
YouTube8M [1] 2016 8M 229.6s 3862
Charades [186] 2016 9.8K 30.1s 157
Kinetics400 [100] 2017 306K 10s 400
Kinetics600 [12] 2018 482K 10s 600
Kinetics700 [13] 2019 650K 10s 700
Sth-Sth V1 [69] 2017 108.5K [2,6]s 174
Sth-Sth V2 [69] 2017 220.8K [2,6]s 174
AVA [70] 2017 385K 15m 80
AVA-kinetics [117] 2020 624K 15m,10s 80
MIT [142] 2018 1M 3s 339
HACS Clips [267] 2019 1.55M 2s 200
HVU [34] 2020 572K 10s 739
AViD [165] 2020 450K [3,15]s 887

HMDB51 [109] was introduced in 2011. It was collected mainly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6,849 clips divided into 51 action categories, each containing a minimum of 101 clips. The dataset has three official splits. Most previous papers either report the top-1 classification accuracy on split 1 or the average accuracy over three splits.

HMDB51[109]于2011年推出。它主要从电影中收集,另外很少一部分从公共数据库(如Prelinger archive、YouTube和Google videos)中收集。数据集包含51个动作分类标签和6849个视频剪辑,每个分类至少包含101个视频剪辑。数据集具有三个官方的划分分段。以前的大多数论文甚至包括最佳分类准确率的论文都报告了基于分段1或三个分段的平均准确率。

UCF101 [190] was introduced in 2012 and is an extension of the previous UCF50 dataset. It contains 13,320 videos from YouTube spreading over 101 categories of human actions. The dataset has three official splits similar to HMDB51, and is also evaluated in the same manner.


Sports1M [99] was introduced in 2014 as the first largescale video action dataset which consisted of more than 1 million YouTube videos annotated with 487 sports classes. The categories are fine-grained which leads to low interclass variations. It has an official 10-fold cross-validation split for evaluation.


ActivityNet [40] was originally introduced in 2015 and the ActivityNet family has several versions since its initial launch. The most recent ActivityNet 200 (V1.3) contains 200 human daily living actions. It has 10,024 training, 4,926 validation, and 5,044 testing videos. On average there are 137 untrimmed videos per class and 1.41 activity instances per video.

ActivityNet[40]最初于2015年推出,自其最初发布以来,ActivityNet系列有多个版本。最新的ActivityNet 200(V1.3)包含200种人类日常活动。它具有10024个训练视频,4926个验证视频和5044个测试视频。每个动作分类平均含有137个未修剪的视频,每个视频平均含有1.41种动作分类。

YouTube8M [1] was introduced in 2016 and is by far the largest-scale video dataset that contains 8 million YouTube videos (500K hours of video in total) and annotated with 3,862 action classes. Each video is annotated with one or multiple labels by a YouTube video annotation system. This dataset is split into training, validation and test in the ratio 70:20:10. The validation set of this dataset is also extended with human-verified segment annotations to provide temporal localization information.

YouTube8M[1]于2016年推出,是迄今为止规模最大的视频数据集,包含800万个YouTube视频(总共50万小时),包含3862种动作分类。每个视频都由YouTube视频标注系统生成一或多个标签。 该数据集按70:20:10的比例分为训练集、验证集和测试集。该数据集的验证集还通过人工验证的手段对标注进行了扩展,以提供准确的时间定位。

Charades [186] was introduced in 2016 as a dataset for real-life concurrent action understanding. It contains 9,848 videos with an average length of 30 seconds. This dataset includes 157 multi-label daily indoor activities, performed by 267 different people. It has an official train-validation split that has 7,985 videos for training and the remaining 1,863 for validation.


Kinetics Family is now the most widely adopted benchmark. Kinetics400 [100] was introduced in 2017 and it consists of approximately 240k training and 20k validation videos trimmed to 10 seconds from 400 human action categories. The Kinetics family continues to expand, with Kinetics-600 [12] released in 2018 with 480K videos and Kinetics700[13] in 2019 with 650K videos.

Kinetics Family是现在是最被广泛采用的基准测试数据集。Kinetics400[100]于2017年推出,它包含大约24万个训练视频和20万个验证视频,这些视频分为400种动作类别,并统一剪辑到10秒。Kinetics系列不断扩大,于2018年发布了Kinetics-600[12],包含48万个视频;在2019年发布了Kinetics700[13],具有65万个视频。

20BN-Something-Something [69] V1 was introduced in 2017 and V2 was introduced in 2018. This family is another popular benchmark that consists of 174 action classes that describe humans performing basic actions with everyday objects. There are 108,499 videos in V1 and 220,847 videos in V2. Note that the Something-Something dataset requires strong temporal modeling because most activities cannot be inferred based on spatial features alone (e.g. opening something, covering something with something).


AVA [70] was introduced in 2017 as the first large-scale spatio-temporal action detection dataset. It contains 430 15-minute video clips with 80 atomic actions labels (only 60 labels were used for evaluation). The annotations were provided at each key-frame which lead to 214,622 training, 57,472 validation and 120,322 testing samples. The AVA dataset was recently expanded to AVA-Kinetics with 352,091 training, 89,882 validation and 182,457 testing samples [117].


Moments in Time [142] was introduced in 2018 and it is a large-scale dataset designed for event understanding. It contains one million 3 second video clips, annotated with a dictionary of 339 classes. Different from other datasets designed for human action understanding, Moments in Time dataset involves people, animals, objects and natural phenomena. The dataset was extended to Multi-Moments in Time (M-MiT) [143] in 2019 by increasing the number of videos to 1.02 million, pruning vague classes, and increasing the number of labels per video.

Moments in Time[142]于2018年推出,它是一个大型的数据集,旨在事件理解任务。它包含一百万个3秒的视频剪辑,并含有339种分类。与其他为人类动作识别设计的其他数据集不同,Moments in Time数据集涉及人,动物,物体和自然现象。该数据集于2019年扩展为Multi-Moments in Time (M-MiT)[143]数据集,其包含102万个视频,并去除了一些模糊的分类,同时增加每个视频的标注数量。

HACS [267] was introduced in 2019 as a new large-scale dataset for recognition and localization of human actions collected from Web videos. It consists of two kinds of manual annotations. HACS Clips contains 1.55M 2-second clip annotations on 504K videos, and HACS Segments has 140K complete action segments (from action start to end) on 50K videos. The videos are annotated with the same 200 human action classes used in ActivityNet (V1.3) [40].

HACS[267]于2019年推出,是一个新的大规模数据集,用于识别和定位从Web视频中收集的人类行为。它由两种手动标注组成。HACS Clips包含155万包含标注的2秒视频剪辑,HACS Segments包含14万个完整的动作片段(从动作开始到结束)。这些视频使用与ActivityNet(V1.3)[40]相同的200种人类动作分类标注。

HVU [34] dataset was released in 2020 for multi-label multi-task video understanding. This dataset has 572K videos and 3,142 labels. The official split has 481K, 31K and 65K videos for train, validation, and test respectively. This dataset has six task categories: scene, object, action, event, attribute, and concept. On average, there are about 2,112 samples for each label. The duration of the videos varies with a maximum length of 10 seconds.


AViD [165] was introduced in 2020 as a dataset for anonymized action recognition. It contains 410K videos for training and 40K videos for testing. Each video clip duration is between 3-15 seconds and in total it has 887 action classes. During data collection, the authors tried to collect data from various countries to deal with data bias. They also remove face identities to protect privacy of video makers. Therefore, AViD dataset might not be a proper choice for recognizing face-related actions.


Before we dive into the chronological review of methods, we present several visual examples from the above datasets in Figure 4 to show their different characteristics. In the top two rows, we pick action classes from UCF101 [190] and Kinetics400 [100] datasets. Interestingly, we find that these actions can sometimes be determined by the context or scene alone. For example, the model can predict the action riding a bike as long as it recognizes a bike in the video frame. The model may also predict the action cricket bowling if it recognizes the cricket pitch. Hence for these classes, video action recognition may become an object/scene classification problem without the need of reasoning motion/temporal information. In the middle two rows, we pick action classes from Something-Something dataset [69]. This dataset focuses on human-object interaction, thus it is more fine-grained and requires strong temporal modeling. For example, if we only look at the first frame of dropping something and picking something up without looking at other video frames, it is impossible to tell these two actions apart. In the bottom row, we pick action classes from Moments in Time dataset [142]. This dataset is different from most video action recognition datasets, and is designed to have large inter-class and intra-class variation that represent dynamical events at different levels of abstraction. For example, the action climbing can have different actors (person or animal) in different environments (stairs or tree).

在我们按时间顺序深入探究方法之前,我们提供了某些上述数据集中的几个直观示例,如图4所示,以展示它们的不同特征。在前两行中,我们从UCF101[190]和Kinetics400[100]数据集中选择动作分类。有趣的是,我们发现这些动作有时可以仅由上下文或场景来确定。例如,只要模型能够识别出视频帧中的自行车,该模型就可以预测该动作是骑自行车。如果模型能够识别出板球场地,它也可以预测动作为击打板球。因此,对于这些类别,视频动作识别可能成为对象/场景分类问题,而无需推理运动/时间信息。在中间的两行中,我们从Something-Something数据集中选择动作分类[69]。该数据集专注于人与对象的交互,因此它的粒度更细,并且需要强大的能够处理时序信息的模型。例如,如果我们只看放东西的第一帧而没有看其他视频帧的话,就不可能将这两个动作区分开。在最下面一行中,我们从Moments in Time数据集中选择动作分类[142]。此数据集与大多数视频动作识别数据集不同,它被设计成为在类内和类间有巨大的差异,这个差异代表不同抽象级别的动态事件。例如,动作攀登在不同的环境(楼梯或树)中可以有不同的执行者(人或动物)。


There are several major challenges in developing effective video action recognition algorithms.


In terms of dataset, first, defining the label space for training action recognition models is non-trivial. It’s because human actions are usually composite concepts and the hierarchy of these concepts are not well-defined. Second, annotating videos for action recognition are laborious (e.g., need to watch all the video frames) and ambiguous (e.g, hard to determine the exact start and end of an action). Third, some popular benchmark datasets (e.g., Kinetics family) only release the video links for users to download and not the actual video, which leads to a situation that methods are evaluated on different data. It is impossible to do fair comparisons between methods and gain insights.


In terms of modeling, first, videos capturing human actions have both strong intra- and inter-class variations. People can perform the same action in different speeds under various viewpoints. Besides, some actions share similar movement patterns that are hard to distinguish. Second, recognizing human actions requires simultaneous understanding of both short-term action-specific motion information and long-range temporal information. We might need a sophisticated model to handle different perspectives rather than using a single convolutional neural network. Third, the computational cost is high for both training and inference, hindering both the development and deployment of action recognition models. In the next section, we will demonstrate how video action recognition methods developed over the last decade to address the aforementioned challenges.


An Odyssey of Using Deep Learning for Video Action Recognition(使用深度学习进行视频动作识别的冒险之旅)

In this section, we review deep learning based methods for video action recognition from 2014 to present and introduce the related earlier work in context.


From handcrafted features to CNNs(从手工提取特征到卷积神经网络)

Despite there being some papers using Convolutional Neural Networks (CNNs) for video action recognition, [200, 5, 91], hand-crafted features [209, 210, 158, 112], particularly Improved Dense Trajectories (IDT) [210], dominated the video understanding literature before 2015, due to their high accuracy and good robustness. However, handcrafted features have heavy computational cost [244], and are hard to scale and deploy.


With the rise of deep learning [107], researchers started to adapt CNNs for video problems. The seminal work DeepVideo [99] proposed to use a single 2D CNN model on each video frame independently and investigated several temporal connectivity patterns to learn spatio-temporal features for video action recognition, such as late fusion, early fusion and slow fusion. Though this model made early progress with ideas that would prove to be useful later such as a multi-resolution network, its transfer learning performance on UCF101 [190] was 20% less than hand-crafted IDT features (65.4% vs 87.9%). Furthermore, DeepVideo [99] found that a network fed by individual video frames, performs equally well when the input is changed to a stack of frames. This observation might indicate that the learnt spatio-temporal features did not capture the motion well. It also encouraged people to think about why CNN models did not outperform traditional hand-crafted features in the video domain unlike in other computer vision tasks [107, 171].

随着深度学习的兴起[107],研究人员开始尝试使用CNN解决视频问题。DeepVideo[99]开创性的提出在每个视频帧上独立使用单个2D CNN模型,并研究了几种时间关联模式,以学习用于视频动作识别的时空特征,例如后期融合(late fusion),早期融合(early fusion)和缓慢融合(slow fusion)。尽管此模型的想法被后来的多流网络模型证明是有用的,但它在UCF101[190]上的迁移学习性能比手工提取特征的IDT方法低20%(65.4%对87.9%)。此外,DeepVideo[99]发现,当输入单视频帧时,其性能也和输入一堆帧的性能一样好。该实验结果可能表明,学习的时空特征不能很好地捕捉运动特征。它还鼓励人们思考,为什么CNN模型不像其他计算机视觉任务那样在视频领域超过手工提取特征的方法[107,171]。

Two-stream networks(双流网络)

Since video understanding intuitively needs motion information, finding an appropriate way to describe the temporal relationship between frames is essential to improving the performance of CNN-based video action recognition.


Optical flow [79] is an effective motion representation to describe object/scene movement. To be precise, it is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. We show several visualizations of optical flow in Figure 5. As we can see, optical flow is able to describe the motion pattern of each action accurately. The advantage of using optical flow is it provides orthogonal information compared to the the RGB image. For example, the two images on the bottom of Figure 5 have cluttered backgrounds. Optical flow can effectively remove the nonmoving background and result in a simpler learning problem compared to using the original RGB images as input. In addition, optical flow has been shown to work well on video problems. Traditional hand-crafted features such as IDT [210] also contain optical-flow-like features, such as Histogram of Optical Flow (HOF) and Motion Boundary Histogram (MBH).

光流[79]是描述物体/场景运动的一种有效运动表示方法。准确地说,它是视觉场景中物体、物体面和物体边缘的显著运动的模式,该运动是由观察者和视觉场景之间的相对运动引起的。我们在图5中展示了光流的几种可视化效果。我们可以看到,光流能够准确地描述每个动作的运动模式。使用光流的优点是,与RGB图像相比,它提供了正交信息。例如,图5下方的两个图像背景杂乱无章。与使用原始RGB图像作为输入相比,光流可以有效去除静止的背景,使训练模型更简单。此外,光流已被证明可以很好地解决视频问题。传统的手工提取特征的方法(例如IDT [210])也包含了类似光流的特征,例如光流直方图(Histogram of Optical Flow, HOF)和运动边界直方图(Motion Boundary Histogram, MBH)。

Hence, Simonyan et al. [187] proposed two-stream networks, which included a spatial stream and a temporal stream as shown in Figure 6. This method is related to the two-streams hypothesis [65], according to which the human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognizes motion). The spatial stream takes raw video frame(s) as input to capture visual appearance information. The temporal stream takes a stack of optical flow images as input to capture motion information between video frames. To be specific, [187] linearly rescaled the horizontal and vertical components of the estimated flow (i.e., motion in the x-direction and y-direction) to a [0, 255] range and compressed using JPEG. The output corresponds to the two optical flow images shown in Figure 6. The compressed optical flow images will then be concatenated as the input to the temporal stream with a dimension of H x W x 2L, where H,Wand L indicates the height, width and the length of the video frames. In the end, the final prediction is obtained by averaging the prediction scores from both streams.

因此,Simonyan等人[187]提出了双流网络,其中包括一个空间流和一个时间流,如图6所示。该方法与双流假说[65]有关,该假说认为,人类视觉皮层包含两个途径:腹侧流(识别物体)和背侧流(识别运动)。空间流将原始视频帧作为输入来捕获视觉外观信息。时间流将一堆光流图像作为输入,以捕获视频帧之间的运动信息。具体而言,[187]将估计流的水平和垂直分量(即,沿x方向和y方向的运动)线性地重新缩放到[0,255]范围,并使用JPEG压缩。输出的两个光流图像如图6所示。压缩后的光流图像将被连接作为时间流的输入,尺寸为H x W x 2L,其中H,W和L表示视频帧的高度、宽度和长度。最后,通过取两个流的平均预测得分作为最终预测结果。

By adding the extra temporal stream, for the first time, a CNN-based approach achieved performance similar to the previous best hand-crafted feature IDT on UCF101 (88.0% vs 87.9%) and on HMDB51 [109] (59.4% vs 61.1%). [187] makes two important observations. First, motion information is important for video action recognition. Second, it is still challenging for CNNs to learn temporal information directly from raw video frames. Pre-computing optical flow as the motion representation is an effective way for deep learning to reveal its power. Since [187] managed to close the gap between deep learning approaches and traditional hand-crafted features, many follow-up papers on twostream networks emerged and greatly advanced the development of video action recognition. Here, we divide them into several categories and review them individually.

通过添加额外的时间流,基于CNN的方法首次在UCF101(88.0%vs 87.9%)和HMDB51[109](59.4%vs 61.1%)上实现了与以前最佳的基于手工提取特征的IDT方法相似的性能。[187]提出了两个重要的观察。首先,运动信息对于视频动作识别很重要。其次,对于CNN而言,直接从原始视频帧中学习时间信息仍然具有挑战性。预计算光流作为运动表示信息是在视频动作识别领域使深度学习展现其力量的有效方法。由于[187]设法缩小了深度学习方法与传统手工制作功能之间的差距,因此出现了有关双流网络的许多后续论文,并极大地推动了视频动作识别领域的发展。在这里,我们将它们分为几类并分别进行调研。

Using deeper network architectures(使用更深的网络结构)

Two-stream networks [187] used a relatively shallow network architecture [107]. Thus a natural extension to the two-stream networks involves using deeper networks. However, Wang et al. [215] finds that simply using deeper networks does not yield better results, possibly due to overfitting on the small-sized video datasets [190, 109]. Recall from section 2.1, UCF101 and HMDB51 datasets only have thousands of training videos. Hence, Wang et al. [217] introduce a series of good practices, including crossmodality initialization, synchronized batch normalization, corner cropping and multi-scale cropping data augmentation, large dropout ratio, etc. to prevent deeper networks from overfitting. With these good practices, [217] was able to train a two-stream network with the VGG16 model [188] that outperforms [187] by a large margin on UCF101. These good practices have been widely adopted and are still being used. Later, Temporal Segment Networks (TSN) [218] performed a thorough investigation of network architectures, such as VGG16, ResNet [76], Inception [198], and demonstrated that deeper networks usually achieve higher recognition accuracy for video action recognition. We will describe more details about TSN in section 3.2.4.

双流网络[187]使用了相对较浅的网络体系结构[107]。因此,对双流网络的扩展自然就是使用更深的网络。然而,王等人[215]发现,仅使用更深的网络并不能产生更好的结果,这可能是由于数据集过小导致的过度拟合[190,109]。回想第2.1节,UCF101和HMDB51数据集只有数千个训练用视频。因此,王等人[217]引入了一系列良好实践,包括交叉初始化,同步批处理标准化,边角裁剪和多尺度裁剪数据增广,较大的dropout率等,以防止更深层的网络过度拟合。凭借这些良好的实践,[217]使用VGG16模型[188]来训练双流网络,该模型在UCF101上的表现大大优于[187]。这些良好做法已被广泛采用,并且仍在使用中。后来,Temporal Segment Networks(TSN)[218]对网络体系结构(例如VGG16,ResNet[76],Inception[198])进行了全面研究,并证明了更深的网络通常可以实现更高的视频动作识别精确度。我们将在3.2.4节中描述有关TSN的更多详细信息。

Two-stream fusion(双流混合)

Since there are two streams in a two-stream network, there will be a stage that needs to merge the results from both networks to obtain the final prediction. This stage is usually referred to as the spatial-temporal fusion step.


The easiest and most straightforward way is late fusion, which performs a weighted average of predictions from both streams. Despite late fusion being widely adopted [187, 217], many researchers claim that this may not be the optimal way to fuse the information between the spatial appearance stream and temporal motion stream. They believe that earlier interactions between the two networks could benefit both streams during model learning and this is termed as early fusion.

最简单直接的方法是后期融合(late fusion),它的预测结果直接由两个流的加权平均值得到。尽管后期融合被广泛采用[187,217],但许多研究人员称这可能不是在空间表现流和时间运动流之间融合信息的最佳方法。他们认为,在两个网络之间进行早期交互可以在模型学习期间使两个流都受益,这被称为早期融合(early fusion)。

Fusion [50] is one of the first of several papers investigating the early fusion paradigm, including how to perform spatial fusion (e.g., using operators such as sum, max, bilinear, convolution and concatenation), where to fuse the network (e.g., the network layer where early interactions happen), and how to perform temporal fusion (e.g., using 2D or 3D convolutional fusion in later stages of the network). [50] shows that early fusion is beneficial for both streams to learn richer features and leads to improved performance over late fusion. Following this line of research, Feichtenhofer et al. [46] generalizes ResNet [76] to the spatiotemporal domain by introducing residual connections between the two streams. Based on [46], Feichtenhofer et al. [47] further propose a multiplicative gating function for residual networks to learn better spatio-temporal features. Concurrently, [225] adopts a spatio-temporal pyramid to perform hierarchical early fusion between the two streams.

Fusion[50]是研究早期融合方法的几篇论文之一,其中包括如何执行空间融合(例如,使用求和、最大值、双线性、卷积和级联等运算方法),在何处融合网络(例如,发生早期交互的网络层),以及如何执行时间融合(例如,在网络的后期使用2D或3D卷积融合)。[50]表明,早期融合有利于两个流学习更丰富的特征,并与晚期融合相比能够提高性能。按照这一研究路线,Feichtenhofer等人[46]通过引入两个流之间的剩余连接(residual connections),Feichtenhofer等人[46]将ResNet[76]推广到时空域。[47]进一步提出了用于残差网络的乘法门方法(multiplicative garing function),以学习更好的时空特征。同时,[225]采用时空金字塔在两个流之间执行分层的早期融合。

Recurrent neural networks(循环神经网络)

Since a video is essentially a temporal sequence, researchers have explored Recurrent Neural Networks (RNNs) for temporal modeling inside a video, particularly the usage of Long Short-Term Memory (LSTM) [78].


LRCN [37] and Beyond-Short-Snippets [253] are the first of several papers that use LSTM for video action recognition under the two-stream networks setting. They take the feature maps from CNNs as an input to a deep LSTM network, and aggregate frame-level CNN features into videolevel predictions. Note that they use LSTM on two streams separately, and the final results are still obtained by late fusion. However, there is no clear empirical improvement from LSTM models [253] over the two-stream baseline [187]. Following the CNN-LSTM framework, several variants are proposed, such as bi-directional LSTM [205], CNN-LSTM fusion [56] and hierarchical multi-granularity LSTM network [118]. [125] described VideoLSTM which includes a correlation-based spatial attention mechanism and a lightweight motion-based attention mechanism. VideoLSTM not only show improved results on action recognition, but also demonstrate how the learned attention can be used for action localization by relying on just the action class label. Lattice-LSTM [196] extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations, so that it can accurately model long-term and complex motions. ShuttleNet [183] is a concurrent work that considers both feedforward and feedback connections in a RNN to learn long-term dependencies. FASTER [272] designed a FAST-GRU to aggregate clip-level features from an expensive backbone and a cheap backbone. This strategy reduces the processing cost of redundant clips and hence accelerates the inference speed.


However, the work mentioned above [37, 253, 125, 196, 183] use different two-stream networks/backbones. The differences between various methods using RNNs are thus unclear. Ma et al. [135] build a strong baseline for fair comparison and thoroughly study the effect of learning spatiotemporal features by using RNNs. They find that it requires proper care to achieve improved performance, e.g., LSTMs require pre-segmented data to fully exploit the temporal information. RNNs are also intensively studied in video action localization [189] and video question answering [274], but these are beyond the scope of this survey.


Segment-based methods(基于分段的方法)

Thanks to optical flow, two-stream networks are able to reason about short-term motion information between frames. However, they still cannot capture long-range temporal information. Motivated by this weakness of two-stream networks , Wang et al. [218] proposed a Temporal Segment Network (TSN) to perform video-level action recognition. Though initially proposed to be used with 2D CNNs, it is simple and generic. Thus recent work using either 2D or 3D CNNs, are still built upon this framework.

多亏了光流,双流网络才能够推理出帧之间的短期运动信息。但是,它们仍然无法捕获长时间信息。由于双流网络的这种弱点,Wang等人[218]提出了一个时间段网络(TSN)来执行视频级别的动作识别。尽管最初建议与2D CNN一起使用,但它是简单且通用的。 此,使用2D或3D CNN的最新工作仍然建立在此基础框架上。

To be specific, as shown in Figure 6, TSN first divides a whole video into several segments, where the segments distribute uniformly along the temporal dimension. Then TSN randomly selects a single video frame within each segment and forwards them through the network. Here, the network shares weights for input frames from all the segments. In the end, a segmental consensus is performed to aggregate information from the sampled video frames. The segmental consensus could be operators like average pooling, max pooling, bilinear encoding, etc. In this sense, TSN is capable of modeling long-range temporal structure because the model sees the content from the entire video. In addition, this sparse sampling strategy lowers the training cost over long video sequences but preserves relevant information.


Given TSN’s good performance and simplicity, most two-stream methods afterwards become segment-based two-stream networks. Since the segmental consensus is simply doing a max or average pooling operation, a feature encoding step might generate a global video feature and lead to improved performance as suggested in traditional approaches [179, 97, 157]. Deep Local Video Feature (DVOF) [114] proposed to treat the deep networks that trained on local inputs as feature extractors and train another encoding function to map the global features into global labels. Temporal Linear Encoding (TLE) network [36] appeared concurrently with DVOF, but the encoding layer was embedded in the network so that the whole pipeline could be trained end-to-end. VLAD3 and ActionVLAD [123, 63] also appeared concurrently. They extended the NetVLAD layer [4] to the video domain to perform video-level encoding, instead of using compact bilinear encoding as in [36]. To improve the temporal reasoning ability of TSN, Temporal Relation Network (TRN) [269] was proposed to learn and reason about temporal dependencies between video frames at multiple time scales. The recent state-of-the-art efficient model TSM [128] is also segment-based. We will discuss it in more detail in section 3.4.2.

鉴于TSN的良好性能和简便性,此后大多数双流方法都变成了基于分段的双流网络。由于分段信息只是在进行最大或平均池化操作,因此特征编码步骤会生成的全局视频特征,可能是导致性能改进的原因,如传统方法[179、97、157]所建议的那样。Deep Local Video Feature(DVOF)[114]提出将在本地输入上训练的深度网络视为特征提取器,并训练另一种编码方法以将全局特征映射到全局标签中。Temporal Linear Encoding(TLE)网络[36]与DVOF同时出现,但是其编码层嵌入于网络之中,因此可以对整个流程进行端到端的训练。VLAD3和ActionVLAD[123,63]也同时出现。他们将NetVLAD层[4]扩展到视频域以执行视频级编码,而不是像[36]中那样使用紧凑的双线性编码。为了提高TSN的时间推理能力,提出了Temporal Relation Network(TRN)[269],以学习和推理多个时间尺度上视频帧之间的时间依赖性。最近的SOTA(state-of-the-art)模型TSM[128]也是基于分段的。我们将在3.4.2节中更详细地讨论它。

Multi-stream networks(多流网络)

Two-stream networks are successful because appearance and motion information are two of the most important properties of a video. However, there are other factors that can help video action recognition as well, such as pose, object, audio and depth, etc.


Pose information is closely related to human action. We can recognize most actions by just looking at a pose (skeleton) image without scene context. Although there is previous work on using pose for action recognition [150, 246], P-CNN [23] was one of the first deep learning methods that successfully used pose to improve video action recognition. P-CNN proposed to aggregates motion and appearance information along tracks of human body parts, in a similar spirit to trajectory pooling [214]. [282] extended this pipeline to a chained multi-stream framework, that computed and integrated appearance, motion and pose. They introduced a Markov chain model that added these cues successively and obtained promising results on both action recognition and action localization. PoTion [25] was a follow-up work to P-CNN, but introduced a more powerful feature representation that encoded the movement of human semantic keypoints. They first ran a decent human pose estimator and extracted heatmaps for the human joints in each frame. They then obtained the PoTion representation by temporally aggregating these probability maps. PoTion is lightweight and outperforms previous pose representations [23, 282]. In addition, it was shown to be complementary to standard appearance and motion streams, e.g. combining PoTion with I3D [14] achieved state-of-the-art result on UCF101 (98.2%).

姿势信息与人类行为密切相关。我们可以通过仅查看不带场景上下文的姿势(骨骼)图像来识别大多数动作。尽管以前有使用姿势进行动作识别的工作[150,246],但P-CNN [23]是成功使用姿势改善视频动作识别的首批深度学习方法之一。P-CNN提出聚集运动和表现信息,来追踪人体骨骼部件,其思路类似于轨迹池化[214]。[282]将该流程扩展链接到一个多流网络框架,该框架同时计算并集成了表现、运动和姿势信息。他们介绍了一个Markov链模型,该模型先后添加了这些线索,并在动作识别和动作定位方面都取得了可喜的成果。PoTion[25]是P-CNN的后续工作,但引入了更强大的特征表示,来编码人类骨骼关键点的运动。他们首先运行了一个正式的人体姿态估计器,并在每个帧中提取了人体关节的heatmap。然后,他们通过在时间上汇总这些概率图来获得PoTion表示。PoTion是一种轻巧且性能优于以前的姿势表示[23,282]。另外,它被证明是对表现流和运动流的补充,例如。将PoTion与I3D结合使用[14]在UCF101上达到了SOTA的成绩(98.2%)。

Object information is another important cue because most human actions involve human-object interaction. Wu [232] proposed to leverage both object features and scene features to help video action recognition. The object and scene features were extracted from state-of-the-art pretrained object and scene detectors. Wang et al. [252] took a step further to make the network end-to-end trainable. They introduced a two-stream semantic region based method, by replacing a standard spatial stream with a Faster RCNN network [171], to extract semantic information about the object, person and scene.

对象信息是另一个重要线索,因为大多数人类行为都涉及人与对象的交互。Wu[232]提出要同时利用对象特征和场景特征来帮助视频动作识别。对象和场景特征是从具有SOTA成绩的的预训练对象和场景检测器中提取的。Wang等人[252]进一步采取了措施,使网络可以端到端的训练。他们引入了一种基于双流语义区域的方法,通过用Faster RCNN[171]替换标准空间流,以提取有关对象、人物和场景的语义信息。

Audio signals usually come with video, and are complementary to the visual information. Wu et al. [233] introduced a multi-stream framework that integrates spatial, short-term motion, long-term temporal and audio in videos to digest complementary clues. Recently, Xiao et al. [237] introduced AudioSlowFast following [45], by adding another audio pathway to model vision and sound in an unified representation.


In RGB-D video action recognition field, using depth information is standard practice [59]. However, for visionbased video action recognition (e.g., only given monocular videos), we do not have access to ground truth depth information as in the RGB-D domain. An early attempt Depth2Action [280] uses off-the-shelf depth estimators to extract depth information from videos and use it for action recognition.

在RGB-D视频动作识别领域,使用深度信息是标准做法[59]。但是,对于基于视觉的视频动作识别(例如,仅给定的单眼视频),我们无法像RGB-D域那样访问深度信息。较早的尝试Depth2Action [280]使用现成的深度估计器从视频中提取深度信息,并将其用于动作识别。

Essentially, multi-stream networks is a way of multimodality learning, using different cues as input signals to help video action recognition. We will discuss more on multi-modality learning in section 5.12.

本质上,多流网络是一种多模式学习的方法,它使用不同的线索作为输入信号来帮助视频动作识别。 我们将在5.12节中讨论有关多模式学习的更多信息。

The rise of 3D CNNs(3D卷积神经网络的崛起)

Pre-computing optical flow is computationally intensive and storage demanding, which is not friendly for large-scale training or real-time deployment. A conceptually easy way to understand a video is as a 3D tensor with two spatial and one time dimension. Hence, this leads to the usage of 3D CNNs as a processing unit to model the temporal information in a video.

预计算光流的计算量大且存储要求高,这对于大规模训练或实时部署而言并不友好。从概念上讲,了解视频的一种简便方法是将其作为具有两个空间和一个时间维度的3D张量。因此,可以使用3D CNN作为一个处理单元,对视频的时序信息进行建模。

The seminal work for using 3D CNNs for action recognition is [91]. While inspiring, the network was not deep enough to show its potential. Tran et al. [202] extended [91] to a deeper 3D network, termed C3D. C3D follows the modular design of [188], which could be thought of as a 3D version of VGG16 network. Its performance on standard benchmarks is not satisfactory, but shows strong generalization capability and can be used as a generic feature extractor for various video tasks [250].

使用3D CNN进行动作识别的开创之作是[91]。在鼓舞人心的同时,该网络的深度还不足以显示其潜力。Tran等人[202]将[91]扩展到了更深的3D网络,称为C3D。C3D遵循[188]的模块化设计,可以将其视为VGG16网络的3D版本。它在基准性能测试上的性能并不令人满意,但是显示出强大的泛化能力,可以用作各种视频任务的通用特征提取器[250]。

However, 3D networks are hard to optimize. In order to train a 3D convolutional filter well, people need a largescale dataset with diverse video content and action categories. Fortunately, there exists a dataset, Sports1M [99] which is large enough to support the training of a deep 3D network. However, the training of C3D takes weeks to converge. Despite the popularity of C3D, most users just adopt it as a feature extractor for different use cases instead of modifying/fine-tuning the network. This is partially the reason why two-stream networks based on 2D CNNs dominated the video action recognition domain from year 2014 to 2017.

但是,3D网络很难优化。为了很好地训练3D卷积卷积核,人们需要具有不同视频内容和动作类别的大规模数据集。幸运的是,存在一个数据集Sports1M[99],该数据集足够大以支持深度3D网络的训练。但是,对C3D的训练需要花费数周的时间才能收敛。尽管C3D流行,但大多数用户只是将其用作针对不同用例的特征提取器,而不是修改/微调网络。这是2014年至2017年间,基于2D CNN的双流网络在视频动作识别领域中仍占据主导地位的原因。

The situation changed when Carreira et al. [14] proposed I3D in year 2017. As shown in Figure 6, I3D takes a video clip as input, and forwards it through stacked 3D convolutional layers. A video clip is a sequence of video frames, usually 16 or 32 frames are used. The major contributions of I3D are: 1) it adapts mature image classification architectures to use for 3D CNN; 2) For model weights, it adopts a method developed for initializing optical flow networks in [217] to inflate the ImageNet pre-trained 2D model weights to their counterparts in the 3D model. Hence, I3D bypasses the dilemma that 3D CNNs have to be trained from scratch. With pre-training on a new large-scale dataset Kinetics400 [100], I3D achieved a 95.6% on UCF101 and 74.8% on HMDB51. I3D ended the era where different methods reported numbers on small-sized datasets such as UCF101 and HMDB51. Publications following I3D needed to report their performance on Kinetics400, or other large-scale benchmark datasets, which pushed video action recognition to the next level. In the next few years, 3D CNNs advanced quickly and became top performers on almost every benchmark dataset. We will review the 3D CNNs based literature in several categories below.

这个状况在Carreira等人[14]于2017年提出I3D后发生了改变。如图6所示,I3D将视频剪辑作为输入,并通过堆叠的3D卷积层转发传播。视频剪辑是一系列视频帧,通常采用16或32帧。I3D的主要贡献是:1)它将成熟的用于图像分类的网络体系结构用于3D CNN;2)对于模型权重,它采用[217]中开发的用于初始化光流网络的方法,将ImageNet预训练的2D模型权重扩张为3D模型中的对应权重。因此,I3D绕开了必须从零开始训练3D CNN的难题。通过在新的大规模数据集Kinetics400[100]上进行预训练,I3D在UCF101上达到了95.6%的准确度,在HMDB51上达到了74.8%的准确度。I3D结束了以不同的方法在小型数据集(例如UCF101和HMDB512)上报告性能的时代。I3D之后的论文需要在Kinetics400或其他大规模基准数据集上报告其性能,这将视频动作识别领域推向了新的高度。在接下来的几年中,3D CNN迅速发展,并成为几乎所有基准数据集上的佼佼者。我们将在以下几个类别中回顾基于3D CNN的文献。

We want to point out that 3D CNNs are not replacing two-stream networks, and they are not mutually exclusive. They just use different ways to model the temporal relationship in a video. Furthermore, the two-stream approach is a generic framework for video understanding, instead of a specific method. As long as there are two networks, one for spatial appearance modeling using RGB frames, the other for temporal motion modeling using optical flow, the method may be categorized into the family of two-stream networks. In [14], they also build a temporal stream with I3D architecture and achieved even higher performance, 98.0% on UCF101 and 80.9% on HMDB51. Hence, the final I3D model is a combination of 3D CNNs and twostream networks. However, the contribution of I3D does not lie in the usage of optical flow.

我们要指出的是3D CNN并不能替代双流网络,而且它们也不是互斥的。他们只是使用不同的方式来建模视频中的时序信息。此外,两流方法是用于视频理解的通用框架,而不是特定的方法。只要有两个网络,一个用于使用RGB帧的空间外观信息建模,另一个用于使用光流的时间运动信息建模,则该方法就可以归类为双流网络。在[14]中,他们还使用I3D架构构建了一个时间流,并获得了更高的性能,UCF101上准确度为98.0%,HMDB51上准确度为80.9%。因此,最终的I3D模型是3D CNN和双流网络的结合。但是,I3D的贡献并不在于光流的使用。

Mapping from 2D to 3D CNNs(映射2D CNN至3D CNN)

2D CNNs enjoy the benefit of pre-training brought by the large-scale of image datasets such as ImageNet [30] and Places205 [270], which cannot be matched even with the largest video datasets available today. On these datasets numerous efforts have been devoted to the search for 2D CNN architectures that are more accurate and generalize better. Below we describe the efforts to capitalize on these advances for 3D CNNs.

2D CNN可以享受大规模图像数据集(如ImageNet[30]和Places205[270])带来的预训练优势,即使当今最大的视频数据集也无法与之匹敌。在这些数据集上,人们进行了许多努力来寻找更准确、通用性更好的2D CNN架构。下面,我们描述如何为了3D CNN利用这些数据集而所做的努力。

ResNet3D [74] directly took 2D ResNet [76] and replaced all the 2D convolutional filters with 3D kernels. They believed that by using deep 3D CNNs together with large-scale datasets one can exploit the success of 2D CNNs on ImageNet. Motivated by ResNeXt [238], Chen et al. [20] presented a multi-fiber architecture that slices a complex neural network into an ensemble of lightweight networks (fibers) that facilitate information flow between fibers, reduces the computational cost at the same time. Inspired by SENet [81], STCNet [33] propose to integrate channel-wise information inside a 3D block to capture both spatial-channels and temporal-channels correlation information throughout the network.

ResNet3D[74]直接采用2D ResNet[76],并用3D卷积核替换了所有2D卷积核。他们认为,通过结合更深的3D CNN和大规模数据集,人们可以复现2D CNN在ImageNet上取得的成功。受ResNeXt [238]的启发,Chen等人[20]提出了一种多纤维(multi-fiber)架构,该架构将复杂的神经网络切成数个轻量级网络(纤维)的集合,从而促进了纤维之间的信息流,同时降低了计算成本。受SENet[81]的启发,STCNet[33]提出在3D块整合通道方式(channel-wise)信息,以捕获整个网络中的空间通道和时间通道相关信息。

Unifying 2D and 3D CNNs(统一2D和3D CNN)

To reduce the complexity of 3D network training, P3D [169] and R2+1D [204] explore the idea of 3D factorization. To be specific, a 3D kernel (e.g., 3 x 3 x 3) can be factorized to two separate operations, a 2D spatial convolution (e.g., 1 x 3 x 3) and a 1D temporal convolution (e.g., 3 x 1 x 1). The differences between P3D and R2+1D are how they arrange the two factorized operations and how they formulate each residual block. Trajectory convolution [268] follows this idea but uses deformable convolution for the temporal component to better cope with motion.

为了降低3D网络训练的复杂性,P3D[169]和R2+1D[204]探索了3D分解的思想。具体而言,可以将3D卷积核(例如3 x 3 x 3)分解为两个单独的操作,即2D空间卷积(例如1 x 3 x 3)和1D时间卷积(例如3 x 1 x 1)。P3D和R2 + 1D之间的区别在于它们如何安排两个分解运算以及它们如何公式化每个残差块。轨迹卷积(Trajectory convolution)[268]遵循了这一思想,但对时间分量使用了可变形卷积以更好地应对运动信息。

Another way of simplifying 3D CNNs is to mix 2D and 3D convolutions in a single network. MiCTNet [271] integrates 2D and 3D CNNs to generate deeper and more informative feature maps, while reducing training complexity in each round of spatio-temporal fusion. ARTNet [213] introduces an appearance-and-relation network by using a new building block. The building block consists of a spatial branch using 2D CNNs and a relation branch using 3D CNNs. S3D [239] combines the merits from approaches mentioned above. It first replaces the 3D convolutions at the bottom of the network with 2D kernels, and find that this kind of top-heavy network has higher recognition accuracy. Then S3D factorizes the remaining 3D kernels as P3D and R2+1D do, to further reduce the model size and training complexity. A concurrent work named ECO [283] also adopts such a top-heavy network to achieve online video understanding.

简化3D CNN的另一种方法是在单个网络中混合2D和3D卷积。MiCTNet[271]集成了2D和3D CNN以生成更深入、更有用的特征图,同时降低了每一轮时空融合的训练复杂性。ARTNet[213]通过使用新的结构块引入了外观和关系网络(appearance-and-relation network)。结构块由使用2D CNN的空间分块和使用3D CNN的关系分块组成。S3D[239]结合了上述方法的优点。它首先用2D卷积核替换了网络底部的3D卷积,并发现这种头重脚轻的网络具有更高的识别准确度。然后,S3D像P3D和R2+1D一样分解其余3D内核,以进一步减少模型的大小和训练复杂性。一个同时进行的名叫ECO[283]的工作也采用了这样一个头重脚轻的网络来实现在线视频理解。

Long-range temporal modeling(长时序信息建模)

In 3D CNNs, long-range temporal connection may be achieved by stacking multiple short temporal convolutions, e.g., 3 x 3 x 3 filters. However, useful temporal information may be lost in the later stages of a deep network, especially for frames far apart.

在3D CNN中,可以通过堆叠多个短时序卷积(例如 3 x 3 x 3卷积核)来实现长时序连接。但是,有用的时间信息可能会在深度网络的后期阶段丢失,尤其是对于相距较远的帧而言。

In order to perform long-range temporal modeling, LTC [206] introduces and evaluates long-term temporal convolutions over a large number of video frames. However, limited by GPU memory, they have to sacrifice input resolution to use more frames. After that, T3D [32] adopted a densely connected structure [83] to keep the original temporal information as complete as possible to make the final prediction. Later,Wang et al. [219] introduced a new building block, termed non-local. Non-local is a generic operation similar to self-attention [207], which can be used for many computer vision tasks in a plug-and-play manner. As shown in Figure 6, they used a spacetime non-local module after later residual blocks to capture the long-range dependence in both space and temporal domain, and achieved improved performance over baselines without bells and whistles. Wu et al. [229] proposed a feature bank representation, which embeds information of the entire video into a memory cell, to make context-aware prediction. Recently, V4D [264] proposed video-level 4D CNNs, to model the evolution of long-range spatio-temporal representation with 4D convolutions.

为了执行长时序信息建模,LTC[206]引入并评估了大量视频帧上的长时序卷积。但是,受显存的限制,它们必须牺牲输入分辨率才能使用更多帧。此后,T3D[32]采用密集连接的结构[83]来使原始时序信息尽可能完整,以便进行最终预测。后来,王等人[219]引入了一个新的结构块,称为non-local。non-local是一种类似于自注意力机制[207]的通用操作,可以即插即用的方式用于许多计算机视觉任务。如图6所示,他们在残差块的后面使用一个时空non-local模块来同时捕获时空范围内的长期依赖关系,并且在没有花里胡哨的情况下实现了优于基线的性能。Wu等人[229]提出了一个特征库表示,它将整个视频的信息嵌入到一个存储单元中,以进行上下文感知的预测。最近,V4D[264]提出了视频级4D CNN,以对具有4D卷积的长序时空表示的演化信息进行建模。

Enhancing 3D efficiency(提高3D卷积的效率)

In order to further improve the efficiency of 3D CNNs (i.e., in terms of GFLOPs, model parameters and latency), many variants of 3D CNNs begin to emerge.

为了进一步提高3D CNN的效率(就GFLOP、模型参数和延迟而言),出现了许多3D CNN的变体。

Motivated by the development in efficient 2D networks, researchers started to adopt channel-wise separable convolution and extend it for video classification [111, 203]. CSN [203] reveals that it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions, and is able to obtain state-of-the-art performance while being 2 to 3 times faster than the previous best approaches. These methods are also related to multi-fiber networks [20] as they are all inspired by group convolution.

受2D网络高效发展的推动,研究人员开始将可分离通道方式卷积扩展到视频分类[111,203]上。CSN[203]揭示了通过分离通道交互和时空交互是一种有效的方法,并且能够达到SOTA的成绩,同时比以前的最佳方法还要快2至3倍。这些方法也都与多纤维网络有关[20],因为它们都受到组卷积(group convolution)方法的启发。

Recently, Feichtenhofer et al. [45] proposed SlowFast, an efficient network with a slow pathway and a fast pathway. The network design is partially inspired by the biological Parvo- and Magnocellular cells in the primate visual systems. As shown in Figure 6, the slow pathway operates at low frame rates to capture detailed semantic information, while the fast pathway operates at high temporal resolution to capture rapidly changing motion. In order to incorporate motion information such as in two-stream networks, SlowFast adopts a lateral connection to fuse the representation learned by each pathway. Since the fast pathway can be made very lightweight by reducing its channel capacity, the overall efficiency of SlowFast is largely improved. Although SlowFast has two pathways, it is different from the two-stream networks [187], because the two pathways are designed to model different temporal speeds, not spatial and temporal modeling. There are several concurrent papers using multiple pathways to balance the accuracy and efficiency [43].


Following this line, Feichtenhofer [44] introduced X3D that progressively expand a 2D image classification architecture along multiple network axes, such as temporal duration, frame rate, spatial resolution, width, bottleneck width, and depth. X3D pushes the 3D model modification/factorization to an extreme, and is a family of efficient video networks to meet different requirements of target complexity. With similar spirit, A3D [276] also leverages multiple network configurations. However, A3D trains these configurations jointly and during inference deploys only one model. This makes the model at the end more efficient. In the next section, we will continue to talk about efficient video modeling, but not based on 3D convolutions.

遵循这一思路,Feichtenhofer[44]引入了X3D,X3D沿多个网络轴逐步扩展了2D图像分类体系结构,例如时序持续时间,帧速率,空间分辨率,宽度,瓶颈宽度(bottleneck width)和深度。X3D将3D模型的修改/分解推到了极致,是一种高效的视频网络,并且可以满足不同目标的不同负载型要求。 一个类似的启发,A3D[276]还利用了多种网络配置。但是,A3D同时训练这些网络配置,并且在推理期间仅部署一个网络。这样可以使模型的最终效率更高。在下一节中,我们将继续讨论有效的视频建模,但不基于3D卷积。

Efficient Video Modeling(提高视频建模效率)

With the increase of dataset size and the need for deployment, efficiency becomes an important concern.


If we use methods based on two-stream networks, we need to pre-compute optical flow and store them on local disk. Taking Kinetics400 dataset as an illustrative example, storing all the optical flow images requires 4.5TB disk space. Such a huge amount of data would make I/O become the tightest bottleneck during training, leading to a waste of GPU resources and longer experiment cycle. In addition, pre-computing optical flow is not cheap, which means all the two-stream networks methods are not real-time.


If we use methods based on 3D CNNs, people still find that 3D CNNs are hard to train and challenging to deploy. In terms of training, a standard SlowFast network trained on Kinetics400 dataset using a high-end 8-GPU machine takes 10 days to complete. Such a long experimental cycle and huge computing cost makes video understanding research only accessible to big companies/labs with abundant computing resources. There are several recent attempts to speed up the training of deep video models [230], but these are still expensive compared to most image-based computer vision tasks. In terms of deployment, 3D convolution is not as well supported as 2D convolution for different platforms. Furthermore, 3D CNNs require more video frames as input which adds additional IO cost.

如果我们使用基于3D CNN的方法,人们仍然会发现3D CNN很难训练并且难以部署。在训练方面,使用高端的拥有8个GPU的机器在Kinetics400数据集上训练标准SlowFast的网络需要10天才能完成。如此长的训练周期和巨大的计算成本,使得视频理解研究只有拥有大量计算资源的大公司/实验室才能进行。最近有几种尝试来加快深度视频模型的训练速度[230],但是与大多数基于图像的计算机视觉任务相比,这些方法仍然昂贵。在部署方面,在不同平台下对3D卷积的支持程度仍不如2D卷积。此外,3D CNN需要更多的视频帧作为输入,这增加了额外的IO成本。

Hence, starting from year 2018, researchers started to investigate other alternatives to see how they could improve accuracy and efficiency at the same time for video action recognition. We will review recent efficient video modeling methods in several categories below.


Flow-mimic approaches(流模拟方法)

One of the major drawback of two-stream networks is its need for optical flow. Pre-computing optical flow is computationally expensive, storage demanding, and not end-toend trainable for video action recognition. It is appealing if we can find a way to encode motion information without using optical flow, at least during inference time.


[146] and [35] are early attempts for learning to estimate optical flow inside a network for video action recognition. Although these two approaches do not need optical flow during inference, they require optical flow during training in order to train the flow estimation network. Hidden two-stream networks [278] proposed MotionNet to replace the traditional optical flow computation. MotionNet is a lightweight network to learn motion information in an unsupervised manner, and when concatenated with the temporal stream, is end-to-end trainable. Thus, hidden twostream CNNs [278] only take raw video frames as input and directly predict action classes without explicitly computing optical flow, regardless of whether its the training or inference stage. PAN [257] mimics the optical flow features by computing the difference between consecutive feature maps. Following this direction, [197, 42, 116, 164] continue to investigate end-to-end trainable CNNs to learn opticalflow-like features from data. They derive such features directly from the definition of optical flow [255]. MARS [26] and D3D [191] used knowledge distillation to combine twostream networks into a single stream, e.g., by tuning the spatial stream to predict the outputs of the temporal stream. Recently, Kwon et al. [110] introduced MotionSqueeze module to estimate the motion features. The proposed module is end-to-end trainable and can be plugged into any network, similar to [278].

[146]和[35]是用于学习估算网络内部的光流以进行视频动作识别的早期尝试。尽管这两种方法在推理过程中不需要光流,但是它们在训练过程中仍需要光流以训练光流估计网络。Hidden two-stream networks[278]提出了MotionNet来代替传统的光流计算。MotionNet是一种轻量级的网络,用于以无监督的方式学习运动信息,并且与时间流连接时,是端到端可训练的。因此,Hidden two-stream networks[278]仅采用原始视频帧作为输入即可直接预测动作类别,而无需显式计算光流,即不论是训练阶段还是预测推理阶段。PAN[257]通过计算连续特征图之间的差异来模拟光流特征。遵循这个方向,[197、42、116、164]继续研究端到端的可训练CNN,以从数据中学习类似光流的特征。他们直接从光流的定义中推导了这些特征[255]。MARS[26]和D3D[191]使用知识蒸馏(knowledge distillation)法将双流网络合并为单个流,例如,通过调整空间流以预测时间流的输出。最近,Kwon等人[110]引入了MotionSqueeze模块来估计运动特征。该模块是端到端可训练的,并且可以插入到任何网络中去,类似于[278]。

Temporal modeling without 3D convolution(不使用3D卷积建立时序模型)

A simple and natural choice to model temporal relationship between frames is using 3D convolution. However, there are many alternatives to achieve this goal. Here, we will review some recent work that performs temporal modeling without 3D convolution.


Lin et al. [128] introduce a new method, termed temporal shift module (TSM). TSM extends the shift operation [228] to video understanding. It shifts part of the channels along the temporal dimension, thus facilitating information exchanged among neighboring frames. In order to keep spatial feature learning capacity, they put temporal shift module inside the residual branch in a residual block. Thus all the information in the original activation is still accessible after temporal shift through identity mapping. The biggest advantage of TSM is that it can be inserted into a 2D CNN to achieve temporal modeling at zero computation and zero parameters. Similar to TSM, TIN [182] introduces a temporal interlacing module to model the temporal convolution.

Lin等人[128]引入了一种新的方法,称为temporal shift module(TSM)。TSM将移位操作[228]扩展到视频理解领域。它沿着时间维度移动部分通道,从而促进相邻帧之间交换的信息。为了保持空间特征的学习能力,他们将时间偏移模块放在残差块的残差分支内。因此,在进行时间偏移之后恒等映射后,仍可以访问原始激活中的所有信息。TSM的最大优点是可以将其插入2D CNN中,以零计算和零参数的方式实现时间建模。类似于TSM,TIN[182]引入了一个时间交织模块来对时间卷积进行建模。

There are several recent 2D CNNs approaches using attention to perform long-term temporal modeling [92, 122, 132, 133]. STM [92] proposes a channel-wise spatiotemporal module to present the spatio-temporal features and a channel-wise motion module to efficiently encode motion features. TEA [122] is similar to STM, but inspired by SENet [81], TEA uses motion features to recalibrate the spatio-temporal features to enhance the motion pattern. Specifically, TEA has two components: motion excitation and multiple temporal aggregation, while the first one handles short-range motion modeling and the second one efficiently enlarge the temporal receptive field for long-range temporal modeling. They are complementary and both light-weight, thus TEA is able to achieve competitive results with previous best approaches while keeping FLOPs as low as many 2D CNNs. Recently, TEINet [132] also adopts attention to enhance temporal modeling. Note that, the above attention-based methods are different from non-local [219], because they use channel attention while nonlocal uses spatial attention.

最近有几种使用注意力进行长序时间建模的2D CNN方法[92,122,132,133]。STM[92]提出了用一个时空通道模块来表示时空特征,以及一个运动通道模块来有效地编码运动特征。TEA[122]与STM类似,但受SENet[81]的启发,TEA使用运动特征重新校准时空特征以增强运动模式。具体来说,TEA具有两个部分:运动激励和多时间聚集,第一个部分处理短距离运动信息,第二个部分有效地扩大时间空间,用于长序时间建模。它们是互补的,而且是轻量级的,因此TEA能够使用以前的最佳方法获得竞争性结果的同时,将FLOP尽可能的保持与2D CNN相似的水平。最近,TEINet[132]也开始关注增强时间建模。请注意,上述基于注意力的方法与non-local方法[219]有所不同,因为它们使用通道注意力,而non-local方法则使用空间注意力。


In this section, we are going to show several other directions that are popular for video action recognition in the last decade.


Trajectory-based methods(基于轨迹的方法)

While CNN-based approaches have demonstrated their superiority and gradually replaced the traditional hand-crafted methods, the traditional local feature pipeline still has its merits which should not be ignored, such as the usage of trajectory.


Inspired by the good performance of trajectory-based methods [210], Wang et al. [214] proposed to conduct trajectory-constrained pooling to aggregate deep convolutional features into effective descriptors, which they term as TDD. Here, a trajectory is defined as a path tracking down pixels in the temporal dimension. This new video representation shares the merits of both hand-crafted features and deep-learned features, and became one of the top performers on both UCF101 and HMDB51 datasets in the year 2015. Concurrently, Lan et al. [113] incorporated both Independet Subspace Analysis (ISA) and dense trajectories into the standard two-stream networks, and show the complementarity between data-independent and data-driven approaches. Instead of treating CNNs as a fixed feature extractor, Zhao et al. [268] proposed trajectory convolution to learn features along the temporal dimension with the help of trajectories.

受到基于轨迹法[210]的良好性能的启发,Wang等人[214]提出进行轨迹约束合并,以将深度卷积特征聚合为有效的描述信息,它们被称之为TDD。在此,轨迹被定义为在时间维度上追踪像素的路径。这种新的视频表示方法,同时具有手工提取特征方法和深度学习方法的优点,并在2015年成为UCF101和HMDB51数据集上表现最好的方法之一。同时,Lan等人[113]将Independet Subspace Analysis(ISA)和密集轨迹追踪法合并到标准的双流网络中,并显示出了数据独立方法和数据驱动方法之间的互补性。Zhao等人[268]提出了轨迹卷积,以借助轨迹学习时间维度的特征。

Rank pooling(等级池化)

There is another way to model temporal information inside a video, termed rank pooling (a.k.a learning-to-rank). The seminal work in this line starts from VideoDarwin [53], that uses a ranking machine to learn the evolution of the appearance over time and returns a ranking function. The ranking function should be able to order the frames of a video temporally, thus they use the parameters of this ranking function as a new video representation. VideoDarwin [53] is not a deep learning based method, but achieves comparable performance and efficiency.

还有另一种在视频中建立时序信息模型的方法,称为rank pooling(也称为learning-to-rank)。该系列中的开创性工作始于VideoDarwin[53],它使用排名机器来学习随时间变化的表现并返回排名函数。排名函数应该能够在时间上对视频帧进行排序,因此它们使用此排序功能的参数作为新的视频表现形式。VideoDarwin[53]并不是基于深度学习的方法,但是可以实现可比的性能和效率。

To adapt rank pooling to deep learning, Fernando [54] introduces a differentiable rank pooling layer to achieve end-to-end feature learning. Following this direction, Bilen et al. [9] apply rank pooling on the raw image pixels of a video producing a single RGB image per video, termed dynamic images. Another concurrent work by Fernando [51] extends rank pooling to hierarchical rank pooling by stacking multiple levels of temporal encoding. Finally, [22] propose a generalization of the original ranking formulation [53] using subspace representations and show that it leads to significantly better representation of the dynamic evolution of actions, while being computationally cheap.

为了使rank pooling适应深度学习,Fernando[54]引入了可区分的rank pooling层来实现端到端的特征学习。按照这个方向,Bilen等人[9]在视频的原始图像像素上应用rank pooling,每个视频产生单个RGB图像,称为动态图像。Fernando[51]的另一项并发工作是通过堆叠多个级别的时间编码将rank pooling扩展到分层rank pooling。最后,[22]提出了使用子空间表示法对原始排名公式[53]的概括,并表明它可以显着的更好的表示动作的动态演变,且计算量小。

Compressed video action recognition(压缩视频动作识别)

Most video action recognition approaches use raw videos (or decoded video frames) as input. However, there are several drawbacks of using raw videos, such as the huge amount of data and high temporal redundancy. Video compression methods usually store one frame by reusing contents from another frame (i.e., I-frame) and only store the difference (i.e., P-frames and B-frames) due to the fact that adjacent frames are similar. Here, the I-frame is the original RGB video frame, and P-frames and B-frames include the motion vector and residual, which are used to store the difference. Motivated by the developments in the video compression domain, researchers started to adopt compressed video representations as input to train effective video models.


Since the motion vector has coarse structure and may contain inaccurate movements, Zhang et al. [256] adopted knowledge distillation to help the motion-vector-based temporal stream mimic the optical-flow-based temporal stream. However, their approach required extracting and processing each frame. They obtained comparable recognition accuracy with standard two-stream networks, but were 27 times faster. Wu et al. [231] used a heavyweight CNN for the I frame and lightweight CNN’s for the P frames. This required that the motion vectors and residuals for each P frame be referred back to the I frame by accumulation. DMC-Net [185] is a follow-up work to [231] using adversarial loss. It adopts a lightweight generator network to help the motion vector capturing fine motion details, instead of knowledge distillation as in [256]. A recent paper SCSampler [106], also adopts compressed video representation for sampling salient clips and we will discuss it in the next section 3.5.4. As yet none of the compressed approaches can deal with B-frames due to the added complexity.

由于运动矢量具有粗糙的结构并且可能包含不准确的运动,因此Zhang等人[256]采用知识蒸馏来帮助基于运动矢量的时态流模仿基于光流的时态流。但是,他们的方法需要提取和处理每个帧。他们获得了与标准双流网络相当的识别精度,但速度提高了27倍。Wu等人[231]在I帧中使用了重量级的CNN,在P帧中使用了轻量级的CNN。这就要求每个P帧的运动矢量和残差通过累加返回I帧。DMC-Net[185]是使用对抗损失(adversarial loss)的[231]的后续工作。它采用轻量级的生成网络来帮助运动矢量捕获精细的运动细节,而不是像[256]中那样进行知识提取。最近的论文SCSampler[106]也采用压缩视频表示来采样显着片段,我们将在下一个3.5.4节中讨论它。迄今为止,由于复杂性的增加,没有一种压缩方法可以处理B帧。

Frame/Clip sampling(帧/片采样)

Most of the aforementioned deep learning methods treat every video frame/clip equally for the final prediction. However, discriminative actions only happen in a few moments, and most of the other video content is irrelevant or weakly related to the labeled action category. There are several drawbacks of this paradigm. First, training with a large proportion of irrelevant video frames may hurt performance. Second, such uniform sampling is not efficient during inference.


Partially inspired by how human understand a video using just a few glimpses over the entire video [251], many methods were proposed to sample the most informative video frames/clips for both improving the performance and making the model more efficient during inference.


he first attempts to propose an end-to-end framework to simultaneously identify key volumes and do action classification. Later, [98] introduce AdaScan that predicts the importance score of each video frame in an online fashion, which they term as adaptive temporal pooling. Both of these methods achieve improved performance, but they still adopt the standard evaluation scheme which does not show efficiency during inference. Recent approaches focus more on the efficiency [41, 234, 8, 106]. AdaFrame [234] follows [251, 98] but uses a reinforcement learning based approach to search more informative video clips. Concurrently, [8] uses a teacher-student framework, i.e., a see-it-all teacher can be used to train a compute efficient see-very-little student. They demonstrate that the efficient student network can reduce the inference time by 30% and the number of FLOPs by approximately 90% with negligible performance drop. Recently, SCSampler [106] trains a lightweight network to sample the most salient video clips based on compressed video representations, and achieve state-of-the-art performance on both Kinetics400 and Sports1M dataset. They also empirically show that such saliency-based sampling is not only efficient, but also enjoys higher accuracy than using all the video frames.


Visual tempo(视觉速度)

Visual tempo is a concept to describe how fast an action goes. Many action classes have different visual tempos. In most cases, the key to distinguish them is their visual tempos, as they might share high similarities in visual appearance, such as walking, jogging and running [248]. There are several papers exploring different temporal rates (tempos) for improved temporal modeling [273, 147, 82, 281, 45, 248]. Initial attempts usually capture the video tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid [273, 147, 281]. Recently, SlowFast [45], as we discussed in section 3.3.4, utilizes the characteristics of visual tempo to design a twopathway network for better accuracy and efficiency tradeoff. CIDC [121] proposed directional temporal modeling along with a local backbone for video temporal modeling. TPN [248] extends the tempo modeling to the featurelevel and shows consistent improvement over previous approaches.

视觉速度(visual tempo)是描述动作速度的概念。许多动作类具有不同的视觉速度。在大多数情况下,区分它们的关键是视觉速度,因为它们在视觉外观上可能具有高度相似性,例如步行,慢跑和跑步[248]。有几篇论文探讨了不同的时间速率(速度)以改进时间建模[273,147,82,281,45,248]。最初的尝试通常是通过以多种速率采样原始视频并构建输入级帧金字塔[273、147、281]来捕获视频速度的。最近,如我们在3.3.4节中讨论的,SlowFast[45]利用视觉速度的特性来设计双向路径网络,以实现更好的精度和效率的权衡。CIDC[121]提出了定向时间建模以及用于视频时间建模的本地后端模型。TPN[248]将速度建模扩展到了特征级别,并显示出与以前方法相比的持续改进。

We would like to point out that visual tempo is also widely used in self-supervised video representation learning [6, 247, 16] since it can naturally provide supervision signals to train a deep network. We will discuss more details on self-supervised video representation learning in section 5.13.


Evaluation and Benchmarking(评估和基准测试)

In this section, we compare popular approaches on benchmark datasets. To be specific, we first introduce standard evaluation schemes in section 4.1. Then we divide common benchmarks into three categories, scenefocused (UCF101, HMDB51 and Kinetics400 in section 4.2), motion-focused (Sth-Sth V1 and V2 in section 4.3) and multi-label (Charades in section 4.4). In the end, we present a fair comparison among popular methods in terms of both recognition accuracy and efficiency in section 4.5.

在本节中,我们将在一些基准数据集上比较测试一些流行方法。具体来说,我们首先在4.1节中介绍标准评估模式。然后,我们将常见基准分为三类:以场景为中心(第4.2节中的UCF101,HMDB51和Kinetics400),以运动为中心的(第4.3节中的Sth-Sth V1和V2)1和多标签(第4.4节中的Charades)。最后,我们将在第4.5节中就识别准确性和效率两方面对流行方法进行公平比较。

Evaluation scheme(评估模式)

During model training, we usually randomly pick a video frame/clip to form mini-batch samples. However, for evaluation, we need a standardized pipeline in order to perform fair comparisons.


For 2D CNNs, a widely adopted evaluation scheme is to evenly sample 25 frames from each video following [187, 217]. For each frame, we perform ten-crop data augmentation by cropping the 4 corners and 1 center, flipping them horizontally and averaging the prediction scores (before softmax operation) over all crops of the samples, i.e., this means we use 250 frames per video for inference.

对于2D CNN,广泛采用的评估方案是从每个视频中均匀采样25帧,于[187,217]提出。对于每一帧,我们通过裁剪4个角和1个中心,将它们水平翻转并平均所有样本的预测得分(在softmax操作之前)来执行十种数据增广,即每个视频使用250帧进行推断。

For 3D CNNs, a widely adopted evaluation scheme termed 30-view strategy is to evenly sample 10 clips from each video following [219]. For each video clip, we perform three-crop data augmentation. To be specific, we scale the shorter spatial side to 256 pixels and take three crops of 256 x 256 to cover the spatial dimensions and average the prediction scores.

对于3D CNN,一种广泛采用的评估方案称为30视图策略,从每个视频中均匀采样10个剪辑。对于每个视频剪辑,我们执行三种数据增广,于[219]提出。具体来说,我们将较短的空间边缩放到256像素,并采用三张256 x 256的片覆盖空间维度并平均预测得分。

However, the evaluation schemes are not fixed. They are evolving and adapting to new network architectures and different datasets. For example, TSM [128] only uses two clips per video for small-sized datasets [190, 109], and perform three-crop data augmentation for each clip despite its being a 2D CNN. We will mention any deviations from the standard evaluation pipeline.

但是,评估方案不是固定的。它们正在发展并适应新的网络体系结构和不同的数据集。例如,TSM[128]只使用每个视频仅有两个剪辑的小型数据集[190、109],尽管其采用2D CNN,但仍对每个剪辑执行三种数据增强。我们将标出与标准评估方法的任何差异。

In terms of evaluation metric, we report accuracy for single-label action recognition, and mAP (mean average precision) for multi-label action recognition.


Scene-focused datasets(以场景为中心的数据集)

Here, we compare recent state-of-the-art approaches on scene-focused datasets: UCF101, HMDB51 and Kinetics400. The reason we call them scene-focused is because most action videos in these datasets are short, and can be recognized by static scene appearance alone as shown in Figure 4.


Table 2. Results of widely adopted methods on three scene-focused datasets. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet, S: Sports1M and K: Kinetics400. NL represents non local.
表2:在三个以场景为中心的数据集上被广泛采用的方法的结果。Pre-train指对模型进行预训练的数据集。I:ImageNet,S:Sports1M,K:Kinetics400。NL代表non local。

Method Pre-train Flow Backbone Venue UCF101 HMDB51 Kinetics400
DeepVideo[99] I - AlexNet CVPR 2014 65.4 - -
Two-stream [187] I X CNN-M NeurIPS 2014 88.0 59.4 -
LRCN[37] I X CNN-M CVPR 2015 82.3 - -
TDD[214] I X CNN-M CVPR 2015 90.3 63.2 -
Fusion[50] I X VGG16 CVPR 2016 92.5 65.4 -
TSN[218] I X BN-Inception ECCV 2016 94.0 68.5 73.9
TLE[36] I X BN-Inception CVPR 2017 95.6 71.1 -
___ ___ ___ ___ ___ ___ ___ ___
C3D[202] S - VGG16-like ICCV 2015 82.3 56.8 59.5
I3D[14] I,K - BN-Inception-like CVPR 2017 95.6 74.8 71.1
P3D[169] S - ResNet50-like ICCV 2017 88.6 - 71.6
ResNet3D[74] K - ResNeXt101-like CVPR 2018 94.5 70.2 65.1
R2+1D[204] K - ResNet34-like CVPR 2018 96.8 74.5 72.0
NL I3D[219] I - ResNet101-like CVPR 2018 - - 77.7
S3D[239] I,K - BN-Inception-like ECCV 2018 96.8 75.9 74.7
SlowFast[45] - - ResNet101-NL-like ICCV 2019 - - 79.8
X3D-XXL[44] - - ResNet-like CVPR 2020 - - 80.4
TPN[248] - - ResNet101-like CVPR 2020 - - 78.9
CIDC[121] - - ResNet50-like ECCV 2020 97.9 75.2 75.5
___ ___ ___ ___ ___ ___ ___ ___
Hidden TSN[278] I - BN-Inception ACCV 2018 93.2 66.8 72.8
OFF[197] I - BN-Inception CVPR 2018 96.0 74.2 -
TSM[128] I - ResNet50 ICCV 2019 95.9 73.5 74.1
STM[92] I,K - ResNet50-like ICCV 2019 96.2 72.2 73.7
TEINet[132] I,K - ResNet50-like AAAI 2020 96.7 72.1 76.2
TEA[122] I,K - ResNet50-like CVPR 2020 96.9 73.3 76.1
MSNet[110] I,K - ResNet50-like ECCV 2020 - 77.4 76.4

Following the chronology, we first present results for early attempts of using deep learning and the two-stream networks at the top of Table 2. We make several observations. First, without motion/temporal modeling, the performance of DeepVideo [99] is inferior to all other approaches. Second, it is helpful to transfer knowledge from traditional methods (non-CNN-based) to deep learning. For example, TDD [214] uses trajectory pooling to extract motion-aware CNN features. TLE [36] embeds global feature encoding, which is an important step in traditional video action recognition pipeline, into a deep network.


We then compare 3D CNNs based approaches in the middle of Table 2. Despite training on a large corpus of videos, C3D [202] performs inferior to concurrent work [187, 214, 217], possibly due to difficulties in optimization of 3D kernels. Motivated by this, several papers - I3D [14], P3D [169], R2+1D [204] and S3D [239] factorize 3D convolution filters to 2D spatial kernels and 1D temporal kernels to ease the training. In addition, I3D introduces an inflation strategy to avoid training from scratch by bootstrapping the 3D model weights from well-trained 2D networks. By using these techniques, they achieve comparable performance to the best two-stream network methods [36] without the need for optical flow. Furthermore, recent 3D models obtain even higher accuracy, by using more training samples [203], additional pathways [45], or architecture search [44].

然后,我们在表2的中间比较了基于3D CNN的方法。尽管对大量视频进行了训练,但C3D[202]的性能不及一些同时的工作[187、214、217],这可能是由于3D卷积核难以优化。因此,几篇论文-I3D[14],P3D[169],R2+1D[204]和S3D[239]将3D卷积核分解为2D空间核和1D时间核,以简化训练。此外,I3D引入了一种膨胀策略,通过将来自训练有素的2D网络的权重导入3D模型来避免从头开始进行训练。通过使用这些技术,它们不需要光流就可以达到与最佳双流网络方法相当的性能[36]。此外,最近的3D模型通过使用更多的训练样本[203],更多途径[45]或体系结构搜索[44]获得了更高的准确性。

Finally, we show recent efficient models in the bottom of Table 2. We can see that these methods are able to achieve higher recognition accuracy than two-stream networks (top), and comparable performance to 3D CNNs (middle). Since they are 2D CNNs and do not use optical flow, these methods are efficient in terms of both training and inference. Most of them are real-time approaches, and some can do online video action recognition [128]. We believe 2D CNN plus temporal modeling is a promising direction due to the need of efficiency. Here, temporal modeling could be attention based, flow based or 3D kernel based.

最后,我们在表2的底部显示了最近有效的模型。我们可以看到,这些方法能够实现比两流网络更高的识别精度(顶部),并且具有与3D CNN相当的性能(中间)。由于它们是2D CNN,并且不使用光流,因此这些方法在训练和推理方面都是有效的。其中大多数是实时方法,有些可以进行在线视频动作识别[128]。由于效率的需要,我们认为2D CNN和时序建模是一个有前途的方向。在这里,时序建模可以是基于注意力,基于流或基于3D卷积的。

Motion-focused datasets(以运动为中心的数据集)

In this section, we compare the recent state-of-the-art approaches on the 20BN-Something-Something (Sth-Sth) dataset. We report top1 accuracy on both V1 and V2. Sth-Sth datasets focus on humans performing basic actions with daily objects. Different from scene-focused datasets, background scene in Sth-Sth datasets contributes little to the final action class prediction. In addition, there are classes such as “Pushing something from left to right” and “Pushing something from right to left”, and which require strong motion reasoning.


Table 3. Results of widely adopted methods on Something-Something V1 and V2 datasets. We only report numbers without using optical flow. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet and K: Kinetics400. View means number of temporal clip multiples spatial crop, e.g., 30 means 10 temporal clips with 3 spatial crops each clip.
表3:在Something-Something V1和V2数据集中采用众多方法的结果。我们仅报告不使用光流的结果。Pre-train指对模型进行预训练的数据集。I:ImageNet,K:Kinetics400。View表示时间片段的数量乘以空间裁剪,例如30表示10个时间片段,每个片段具有3个空间裁剪。

Method Pre-train Backbone Frames x Views Venue V1 Top1 V2 Top1
TSN[218] I BN-Inception 8 x 1 ECCV 2016 19.7 -
I3D[14] I,K ResNet50-like 32 x 6 CVPR 2017 41.6 -
NL I3D[219] I,K ResNet50-like 32 x 6 CVPR 2018 44.4 -
NL I3D + GCN[220] I,K ResNet50-like 32 x 6 ECCV 2018 46.1 -
ECO[283] K BNIncep+ResNet18 16 x 1 ECCV 2018 41.4 -
TRN[269] I BN-Inception 8 x 1 ECCV 2018 42.0 48.8
STM[92] I ResNet50-like 8 x 30 ICCV 2019 49.2 -
STM[92] I ResNet50-like 16 x 30 ICCV 2019 50.7 -
TSM[128] K ResNet50 8 x 1 ICCV 2019 45.6 59.1
TSM[128] K ResNet50 16 x 1 ICCV 2019 47.2 63.4
bLVNet-TAM[43] I BLNet-like 8 x 2 NeurIPS 2019 46.4 59.1
bLVNet-TAM[43] I BLNet-like 16 x 2 NeurIPS 2019 48.4 61.7
TEA[122] I ResNet50-like 8 x 1 CVPR 2020 48.9 -
TEA[122] I ResNet50-like 16 x 1 CVPR 2020 51.9 -
TSM + TPN[248] K ResNet50-like 8 x 1 CVPR 2020 49.0 62.0
MSNet[110] I ResNet50-like 8 x 1 ECCV 2020 50.9 63.0
MSNet[110] I ResNet50-like 16 x 1 ECCV 2020 52.1 64.7
TIN[182] K ResNet50-like 16 x 1 AAAI 2020 47.0 60.1
TEINet[132] I ResNet50-like 8 x 1 AAAI 2020 47.4 61.3
TEINet[132] I ResNet50-like 16 x 1 AAAI 2020 49.9 62.1

By comparing the previous work in Table 3, we observe that using longer input (e.g., 16 frames) is generally better. Moreover, methods that focus on temporal modeling [128, 122, 92] work better than stacked 3D kernels [14]. For example, TSM [128], TEA [122] and MSNet [110] insert an explicit temporal reasoning module into 2D ResNet backbones and achieves state-of-the-art results. This implies that the Sth-Sth dataset needs strong temporal motion reasoning as well as spatial semantics information.

通过比较表3中的先前工作,我们观察到使用更长的输入(例如16帧)通常更好。此外,专注于时序建模的方法[128、122、92]比堆叠的3D卷积[14]可以更好地工作。例如,TSM[128],TEA[122]和MSNet[110]将显式的时序推理模块插入2D ResNet后端模型中,并获得最新的结果。这意味着Sth-Sth数据集需要强大的时间运动推理以及空间语义信息。

Multi-label datasets(多标签数据集)

In this section, we first compare the recent state-of-theart approaches on Charades dataset [186] and then we list some recent work that use assemble model or additional object information on Charades. Comparing the previous work in Table 4, we make the following observations. First, 3D models [229, 45] generally perform better than 2D models [186, 231] and 2D models with optical flow input. This indicates that the spatiotemporal reasoning is critical for long-term complex concurrent action understanding. Second, longer input helps with the recognition [229] probably because some actions require long-term feature to recognize. Third, models with strong backbones that are pre-trained on larger datasets generally have better performance [45]. This is because Charades is a medium-scaled dataset which doesn’t contain enough diversity to train a deep model.

在本节中,我们首先比较Charades数据集[186]上的SOTA成果,然后列出一些在Charades上使用assemble model或additional object information的最新工作。比较表4中的先前工作,我们得出以下观察结果。首先,三维模型[229,45]通常比二维模型[186,231],和有光流输入的2D模型有更好的表现。这表明时空推理对于长期复杂的并发动作理解至关重要。其次,较长的输入有助于识别[229],这可能是因为某些动作需要长期特征来识别。第三,有较强的经大规模数据集预训练的后端模型通常有更好的表现[45]。这是因为Charades是一个中等规模的数据集,且没有足够的多样性,以训练深度网络。

Table 4. Charades evaluation using mAP, calculated using the officially provided script. NL: non-local network. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet, K400: Kinetics400 and K600: Kinetics600.

Method Extra-information Backbone Pre-train Venue mAP
2D CNN[186] - AlexNet I ECCV 2016 11.2
Two-stream[186] flow VGG16 I ECCV 2016 22.4
ActionVLAD[63] - VGG16 I CVPR 2017 21.0
CoViAR[231] - ResNet50-like - CVPR 2018 21.9
MultiScale TRN[269] - BN-Inception-like I ECCV 2018 25.2
___ ___ ___ ___ ___ ___
I3D[14] - BN-Inception-like K400 CVPR 2017 32.9
STRG[220] - ResNet101-NL-like K400 ECCV 2018 39.7
LFB[229] - ResNet101-NL-like K400 CVPR 2019 42.5
TC[84] ResNet101-NL-like K400 ICCV 2019 41.1
HAF[212] IDT + flow BN-Inception-like K400 ICCV 2019 43.1
SlowFast[45] - ResNet-like K400 ICCV 2019 42.5
SlowFast[45] - ResNet-like K600 ICCV 2019 45.2
___ ___ ___ ___ ___ ___
Action-Genome[90] person + object ResNet-like - CVPR 2020 60.1
AssembleNet++[177] flow + object ResNet-like - ECCV 2020 59.9

Recently, researchers explored the alternative direction for complex concurrent action recognition by assembling models [177] or providing additional human-object interaction information [90]. These papers significantly outperformed previous literature that only finetune a single model on Charades. It demonstrates that exploring spatio-temporal human-object interactions and finding a way to avoid overfitting are the keys for concurrent action understanding.

最近,研究人员通过assembling model[177]或提供其他人对物体的交互信息[90]探索了复杂的并发动作识别的替代方向。这些论文大大优于以前的文献,后者仅对Charades上的单个模型进行了微调。它表明,探索时空人与物体之间的相互作用并找到避免过度拟合的方法是同时进行动作理解的关键。

Speed comparison(速度比较)

To deploy a model in real-life applications, we usually need to know whether it meets the speed requirement before we can proceed. In this section, we evaluate the approaches mentioned above to perform a thorough comparison in terms of (1) number of parameters, (2) FLOPS, (3) latency and (4) frame per second.


We present the results in Table 5. Here, we use the models in the GluonCV video action recognition model zoo since all these models are trained using the same data, same data augmentation strategy and under same 30-view evaluation scheme, thus fair comparison. All the timings are done on a single Tesla V100 GPU with 105 repeated runs, while the first 5 runs are ignored for warming up. We always use a batch size of 1. In terms of model input, we use the suggested settings in the original paper.

Table 5. Comparison on both efficiency and accuracy. Top: 2D models and bottom: 3D models. FLOPS means floating point operations per second. FPS indicates how many video frames can the model process per second. Latency is the actual running time to complete one network forward given the input. Acc is the top-1 accuracy on Kinetics400 dataset. TSN, I3D, I3D-slow families are pretrained on ImageNet. R2+1D, SlowFast and TPN families are trained from scratch.

Model Input FLOPS(G) # of params(M) FPS Latency(s) Acc(%)
TSN-ResNet18[218] 3x224x224 3.671 21.49 151.96 0.0066 69.85
TSN-ResNet34[218] 3x224x224 1.819 11.382 264.01 0.0038 66.73
TSN-ResNet50[218] 3x224x224 4.110 24.328 114.05 0.0088 70.88
TSN-ResNet101[218] 3x224x224 7.833 43.320 59.56 0.0167 72.25
TSN-ResNet152[218] 3x224x224 11.558 58.963 36.93 0.0271 72.45
___ ___ ___ ___ ___ ___ ___
I3D-ResNet50[14] 3x32x224x224 33.275 28.863 1719.50 0.0372 74.87
I3D-ResNet101[14] 3x32x224x224 51.864 52.574 1137.74 0.0563 75.10
I3D-ResNet50-NL[219] 3x32x224x224 47.737 38.069 1403.16 0.0456 75.17
I3D-ResNet101-NL[219] 3x32x224x224 66.326 61.780 999.94 0.0640 75.81
R2+1D-ResNet18[204] 3x16x112x112 40.645 31.505 804.31 0.0398 71.72
R2+1D-ResNet34[204] 3x16x112x112 75.400 61.832 503.17 0.0636 72.63
R2+1D-ResNet50[204] 3x16x112x112 65.543 53.950 667.06 0.0480 74.92
R2+1D-ResNet152*[204] 3x32x112x112 252.900 118.227 546.19 0.1172 81.34
CSN-ResNet152*[203] 3x32x224x224 74.758 29.704 435.77 0.1469 83.18
I3D-Slow-ResNet50[45] 3x8x224x224 41.919 32.454 1702.60 0.0376 74.41
I3D-Slow-ResNet50[45] 3x16x224x224 83.838 32.454 1406.00 0.0455 76.36
I3D-Slow-ResNet50[45] 3x32x224x224 167.675 32.454 860.74 0.0744 77.89
I3D-Slow-ResNet101[45] 3x8x224x224 85.675 60.359 1114.22 0.0574 76.15
I3D-Slow-ResNet101[45] 3x16x224x224 171.348 60.359 876.20 0.0730 77.11
I3D-Slow-ResNet101[45] 3x32x224x224 342.696 60.359 541.16 0.1183 78.57
SlowFast-ResNet50-4x16[45] 3x32x224x224 27.820 34.480 1396.45 0.0458 75.25
SlowFast-ResNet50-8x8[45] 3x32x224x224 50.583 34.566 1297.24 0.0493 76.66
SlowFast-ResNet101-8x8[45] 3x32x224x224 96.794 62.827 889.62 0.0719 76.95
TPN-ResNet50[248] 3x8x224x224 50.457 71.800 1350.39 0.0474 77.04
TPN-ResNet50[248] 3x16x224x224 99.929 71.800 1128.39 0.0567 77.33
TPN-ResNet50[248] 3x32x224x224 198.874 71.800 716.89 0.0893 78.90
TPN-ResNet101[248] 3x8x224x224 94.366 99.705 942.61 0.0679 78.10
TPN-ResNet101[248] 3x16x224x224 187.594 99.705 754.00 0.0849 79.39
TPN-ResNet101[248] 3x32x224x224 374.048 99.705 479.77 0.1334 79.70

我们将结果表示在表5中。在这里,我们使用GluonCV视频动作识别模型库中的模型,因为所有这些模型都是使用相同的数据,相同的数据增广策略和相同的30视图评估方案训练的,因此比较合理。所有时间都是在单个Tesla V100 GPU上进行的,且迭代105次,而前5次运行会因预热而被忽略。我们始终使用1的批次大小。在模型输入方面,我们使用原始论文中的建议设置。

As we can see in Table 5, if we compare latency, 2D models are much faster than all other 3D variants. This is probably why most real-world video applications still adopt frame-wise methods. Secondly, as mentioned in [170, 259], FLOPS is not strongly correlated with the actual inference time (i.e., latency). Third, if comparing performance, most 3D models give similar accuracy around 75%, but pretraining with a larger dataset can significantly boost the performance. This indicates the importance of training data and partially suggests that self-supervised pre-training might be a promising way to further improve existing methods.


Discussion and Future Work(讨论与未来工作)

We have surveyed more than 200 deep learning based methods for video action recognition since year 2014. Despite the performance on benchmark datasets plateauing, there are many active and promising directions in this task worth exploring.


Analysis and insights(分析与见解)

More and more methods haven been developed to improve video action recognition, at the same time, there are some papers summarizing these methods and providing analysis and insights. Huang et al. [82] perform an explicit analysis of the effect of temporal information for video understanding. They try to answer the question “how important is the motion in the video for recognizing the action”. Feichtenhofer et al. [48, 49] provide an amazing visualization of what two-stream models have learned in order to understand how these deep representations work and what they are capturing. Li et al. [124] introduce a concept, representation bias of a dataset, and find that current datasets are biased towards static representations. Experiments on such biased datasets may lead to erroneous conclusions, which is indeed a big problem that limits the development of video action recognition. Recently, Piergiovanni et al. introduced the AViD [165] dataset to cope with data bias by collecting data from diverse groups of people. These papers provide great insights to help fellow researchers to understand the challenges, open problems and where the next breakthrough might reside.


Data augmentation(数据增广)

Numerous data augmentation methods have been proposed in image recognition domain, such as mixup [258], cutout [31], CutMix [254], AutoAugment [27], FastAutoAug [126], etc. However, video action recognition still adopts basic data augmentation techniques introduced before year 2015 [217, 188], including random resizing, random cropping and random horizontal flipping. Recently, SimCLR [17] and other papers have shown that color jittering and random rotation greatly help representation learning. Hence, an investigation of using different data augmentation techniques for video action recognition is particularly useful. This may change the data pre-processing pipeline for all existing methods.


Video domain adaptation(视频域适配)

Domain adaptation (DA) has been studied extensively in recent years to address the domain shift problem. Despite the accuracy on standard datasets getting higher and higher, the generalization capability of current video models across datasets or domains is less explored. There is early work on video domain adaptation [193, 241, 89, 159]. However, these literature focus on smallscale video DA with only a few overlapping categories, which may not reflect the actual domain discrepancy and may lead to biased conclusions. Chen et al. [15] introduce two larger-scale datasets to investigate video DA and find that aligning temporal dynamics is particularly useful. Pan et al. [152] adopts co-attention to solve the temporal misalignment problem. Very recently, Munro et al. [145] explore a multi-modal self-supervision method for fine-grained video action recognition and show the effectiveness of multi-modality learning in video DA. Shuffle and Attend [95] argues that aligning features of all sampled clips results in a sub-optimal solution due to the fact that all clips do not include relevant semantics. Therefore, they propose to use an attention mechanism to focus more on informative clips and discard the non-informative ones. In conclusion, video DA is a promising direction, especially for researchers with less computing resources.

近年来,对域适配(Domain adaptation, DA)领域进行了广泛的研究,以解决领域转移问题。尽管标准数据集的准确性越来越高,但目前很少研究当前视频模型的跨数据集或域的泛化能力。这些是有关视频域自适应的早期工作[193,241,89,159]。但是,这些文献集中在只有几个重叠类别的小规模视频DA上,这可能无法反映实际的域差异,并可能导致结论有偏差。Chen等人[15]引入了两个较大规模的数据集来研究视频DA,并发现对齐时间动态性特别有用。Pan等人[152]采用共同注意解决时间错位问题。最近,Munro等人[145]探索了一种用于细粒度视频动作识别的多模式自我监督方法,并展示了多模式学习在视频D​​A中的有效性。Shuffle和Attend[95]认为,由于所有剪辑均不包含相关语义,因此将所有采样剪辑的特征对齐会导致次优解决方案。因此,他们建议使用一种注意力机制,将更多的注意力集中在信息剪辑上,而丢弃非信息剪辑。总之,视频DA是一个有前途的方向,特别是对于计算资源较少的研究人员而言。

Neural architecture search(神经结构搜索)

Neural architecture search (NAS) has attracted great interest in recent years and is a promising research direction. However, given its greedy need for computing resources, only a few papers have been published in this area [156, 163, 161, 178]. The TVN family [161], which jointly optimize parameters and runtime, can achieve competitive accuracy with human-designed contemporary models, and run much faster (within 37 to 100 ms on a CPU and 10 ms on a GPU per 1 second video clip). AssembleNet [178] and AssembleNet++ [177] provide a generic approach to learn the connectivity among feature representations across input modalities, and show surprisingly good performance on Charades and other benchmarks. AttentionNAS [222] proposed a solution for spatio-temporal attention cell search. The found cell can be plugged into any network to improve the spatio-temporal features. All previous papers do show their potential for video understanding. Recently, some efficient ways of searching architectures have been proposed in the image recognition domain, such as DARTS [130], Proxyless NAS [11], ENAS [160], oneshot NAS [7], etc. It would be interesting to combine efficient 2D CNNs and efficient searching algorithms to perform video NAS for a reasonable cost.

神经结构搜索(Neural architecture search, NAS)近年来引起了极大的兴趣,并且是一个有前途的研究方向。但是,由于对计算资源的贪婪需求,在该领域仅发表了几篇论文[156,163,161,178]。TVN系列[161]可以共同优化参数和运行时间,可以通过人工设计的现代模型实现具有竞争力的准确性,并且运行速度更快(每1秒视频片段在CPU上运行37至100ms,在GPU上运行10ms)。AssembleNet[178]和AssembleNet++[177]提供了一种通用方法来学习跨输入模态的特征表示之间的连通性,并在Charades和其他基准测试中表现出令人惊讶的良好性能。AttentionNAS[222]提出了一种用于时空注意力单元搜索的解决方案。可以将找到的单元插入任何网络以改善时空特征。以前的所有论文的确显示了其对视频理解的潜力。最近,在图像识别领域已经提出了一些搜索架构的有效方法,例如DARTS[130],Proxyless NAS[11],ENAS[160],oneshot NAS[7]等。2D CNN和高效的搜索算法结合可以合理的节省视频NAS成本。

Efficient model development(高效的模型开发)

Despite their accuracy, it is difficult to deploy deep learning based methods for video understanding problems in terms of real-world applications. There are several major challenges: (1) most methods are developed in offline settings, which means the input is a short video clip, not a video stream in an online setting; (2) most methods do not meet the real-time requirement; (3) incompatibility of 3D convolutions or other non-standard operators on non-GPU devices (e.g., edge devices). Hence, the development of efficient network architecture based on 2D convolutions is a promising direction. The approaches proposed in the image classification domain can be easily adapted to video action recognition, e.g. model compression, model quantization, model pruning, distributed training [68, 127], mobile networks [80, 265], mixed precision training, etc. However, more effort is needed for the online setting since the input to most action recognition applications is a video stream, such as surveillance monitoring. We may need a new and more comprehensive dataset for benchmarking online video action recognition methods. Lastly, using compressed videos might be desirable because most videos are already compressed, and we have free access to motion information.


New datasets(新数据集)

Data is more or at least as important as model development for machine learning. For video action recognition, most datasets are biased towards spatial representations [124], i.e., most actions can be recognized by a single frame inside the video without considering the temporal movement. Hence, a new dataset in terms of long-term temporal modeling is required to advance video understanding. Furthermore, most current datasets are collected from YouTube. Due to copyright/privacy issues, the dataset organizer often only releases the YouTube id or video link for users to download and not the actual video. The first problem is that downloading the large-scale datasets might be slow for some regions. In particular, YouTube recently started to block massive downloading from a single IP. Thus, many researchers may not even get the dataset to start working in this field. The second problem is, due to region limitation and privacy issues, some videos are not accessible anymore. For example, the original Kinetcis400 dataset has over 300K videos, but at this moment, we can only crawl about 280K videos. On average, we lose 5% of the videos every year. It is impossible to do fair comparisons between methods when they are trained and evaluated on different data.

数据与机器学习的模型开发一样重要或至少同样重要。对于视频动作识别,大多数数据集偏向于空间表示[124],即,大多数动作可以通过视频内的单个帧来识别,而无需考虑时间移动。因此,就长期时间建模而言,需要新的数据集来推进视频理解。此外,大多数最新的数据集都是从YouTube收集的。由于版权/隐私问题,数据集组织者通常仅发布YouTube ID或视频链接供用户下载,而不发布实际视频。第一个问题是在某些地区下载大规模数据集可能会很慢。特别是,YouTube最近开始阻止从单个IP进行大量下载。因此,许多研究人员甚至可能无法获得该数据集以开始在该领域中工作。第二个问题是,由于地区限制和隐私问题,一些视频不再可用。例如,原始的Kinetcis400数据集包含超过30万个视频,但是目前,我们只能抓取约28万个视频。平均而言,我们每年损失5%的视频。当对不同的数据进行训练和评估时,不可能在方法之间进行公平的比较。

Video adversarial attack(视频对抗攻击)

Adversarial examples have been well studied on image models. [199] first shows that an adversarial sample, computed by inserting a small amount of noise on the original image, may lead to a wrong prediction. However, limited work has been done on attacking video models. This task usually considers two settings, a white-box attack [86, 119, 66, 21] where the adversary can always get the full access to the model including exact gradients of a given input, or a black-box one [93, 245, 226], in which the structure and parameters of the model are blocked so that the attacker can only access the (input, output) pair through queries. Recent work ME-Sampler [260] leverages the motion information directly in generating adversarial videos, and is shown to successfully attack a number of video classification models using many fewer queries. In summary, this direction is useful since many companies provide APIs for services such as video classification, anomaly detection, shot detection, face detection, etc. In addition, this topic is also related to detecting DeepFake videos. Hence, investigating both attacking and defending methods is crucial to keeping these video services safe.

在图像模型上已经很好地研究了恶意攻击例子。[199]首先表明,通过在原始图像上插入少量噪声而计算出的对抗样本可能会导致错误的预测。但是,在攻击视频模型方面所做的工作有限。此任务通常考虑两种设置,白盒攻击[86,119,66,21],在这种攻击中,对手始终可以完全访问模型,包括给定输入的精确梯度,或者黑盒[93,245,226],其中模型的结构和参数被阻止,以使攻击者只能通过查询访问(输入,输出)对。ME-Sampler [260]的最新工作直接在生成对抗视频时利用了运动信息,并被证明可以使用更少的查询来成功地攻击多种视频分类模型。总而言之,该方向很有用,因为许多公司为诸如视频分类,异常检测,镜头检测,面部检测等服务提供API。此外,该主题还与检测DeepFake视频有关。因此,研究攻击和防御方法对于确保这些视频服务的安全至关重要。

Zero-shot action recognition(零镜头动作识别)

Zero-shot learning (ZSL) has been trending in the image understanding domain, and has recently been adapted to video action recognition. Its goal is to transfer the learned knowledge to classify previously unseen categories. Due to (1) the expensive data sourcing and annotation and (2) the set of possible human actions is huge, zero-shot action recognition is a very useful task for real-world applications. There are many early attempts [242, 88, 243, 137, 168, 57] in this direction. Most of them follow a standard framework, which is to first extract visual features from videos using a pretrained network, and then train a joint model that maps the visual embedding to a semantic embedding space. In this manner, the model can be applied to new classes by finding the test class whose embedding is the nearestneighbor of the model’s output. A recent work URL [279] proposes to learn a universal representation that generalizes across datasets. Following URL [279], [10] present the first end-to-end ZSL action recognition model. They also establish a new ZSL training and evaluation protocol, and provide an in-depth analysis to further advance this field. Inspired by the success of pre-training and then zero-shot in NLP domain, we believe ZSL action recognition is a promising research topic.

零镜头学习(Zero-shot learning, ZSL)在图像理解领域已成为趋势,并且最近已适应于视频动作识别。它的目标是转移学到的知识,对以前看不见的类别进行分类。由于(1)昂贵的数据源和注释,以及(2)人类可能进行的动作种类的集合很大,因此零镜头动作识别对于现实应用程序是非常有用的任务。在这个方向上有许多早期尝试[242、88、243、137、168、57]。它们中的大多数遵循标准框架,该框架首先使用预先训练的网络从视频中提取视觉特征,然后训练将视觉嵌入映射到语义嵌入空间的联合模型。通过这种方式,可以通过找到嵌入其与模型输出的最近邻的测试类,将模型应用于新类。最近的工作URL[279]提议学习一种通用表示形式,该表示形式可以概括整个数据集。在URL[279]之后,[10]提出了第一个端到端ZSL动作识别模型。他们还建立了新的ZSL训练和评估协议,并提供了深入的分析以进一步推进该领域。在NLP领域成功进行预训练然后零击成功的鼓舞下,我们认为ZSL动作识别是一个有前途的研究主题。

Weakly-supervised video action recognition(弱监督的视频动作识别)

Building a high-quality video action recognition dataset [190, 100] usually requires multiple laborious steps: 1) first sourcing a large amount of raw videos, typically from the internet; 2) removing videos irrelevant to the categories in the dataset; 3) manually trimming the video segments that have actions of interest; 4) refining the categorical labels. Weakly-supervised action recognition explores how to reduce the cost for curating training data.


The first direction of research [19, 60, 58] aims to reduce the cost of sourcing videos and accurate categorical labeling. They design training methods that use training data such as action-related images or partially annotated videos, gathered from publicly available sources such as Internet. Thus this paradigm is also referred to as webly-supervised learning [19, 58]. Recent work on omni-supervised learning [60, 64, 38] also follows this paradigm but features bootstrapping on unlabelled videos by distilling the models’own inference results.


The second direction aims at removing trimming, the most time consuming part in annotation. UntrimmedNet [216] proposed a method to learn action recognition model on untrimmed videos with only categorical labels [149, 172]. This task is also related to weaklysupervised temporal action localization which aims to automatically generate the temporal span of the actions. Several papers propose to simultaneously [155] or iteratively [184] learn models for these two tasks.


Fine-grained video action recognition(细粒度的视频动作识别)

Popular action recognition datasets, such as UCF101 [190] or Kinetics400 [100], mostly comprise actions happening in various scenes. However, models learned on these datasets could overfit to contextual information irrelevant to the actions [224, 227, 24]. Several datasets have been proposed to study the problem of fine-grained action recognition, which could examine the models’ capacities in modeling action specific information. These datasets comprise fine-grained actions in human activities such as cooking [28, 108, 174], working [103] and sports [181, 124]. For example, FineGym [181] is a recent large dataset annotated with different moves and sub-actions in gymnastic videos.


Egocentric action recognition(以自我为中心的动作识别)

Recently, large-scale egocentric action recognition [29, 28] has attracted increasing interest with the emerging of wearable cameras devices. Egocentric action recognition requires a fine understanding of hand motion and the interacting objects in the complex environment. A few papers leverage object detection features to offer fine object context to improve egocentric video recognition [136, 223, 229, 180]. Others incorporate spatio-temporal attention [192] or gaze annotations [131] to localize the interacting object to facilitate action recognition. Similar to third-person action recognition, multi-modal inputs (e.g., optical flow and audio) have been demonstrated to be effective in egocentric action recognition [101].

最近,随着可穿戴式相机设备的出现,大规模的自我中心动作识别(egocentric action recognition)[29,28]引起了越来越多的兴趣。以自我为中心的动作识别需要对复杂环境中的手部动作和相互作用的物体有很好的理解。一些论文利用对象检测功能来提供良好的对象上下文,以改善以自我为中心的视频识别[136,223,229,180]。其他人则结合时空注意力[192]或注视信息[131]来定位交互对象以促进动作识别。与第三人称动作识别类似,多模式输入(例如,光流和音频)已被证明在以自我为中心的动作识别中是有效的[101]。


Multi-modal video understanding has attracted increasing attention in recent years [55, 3, 129, 167, 154, 2, 105]. There are two main categories for multi-modal video understanding. The first group of approaches use multimodalities such as scene, object, motion, and audio to enrich the video representations. In the second group, the goal is to design a model which utilizes modality information as a supervision signal for pre-training models [195, 138, 249, 62, 2].


Multi-modality for comprehensive video understanding Learning a robust and comprehensive representation of video is extremely challenging due to the complexity of semantics in videos. Video data often includes variations in different forms including appearance, motion, audio, text or scene [55, 129, 166]. Therefore, utilizing these multi-modal representations is a critical step in understanding video content more efficiently. The multi-modal representations of video can be approximated by gathering representations of various modalities such as scene, object, audio, motion, appearance and text. Ngiam et al. [148] was an early attempt to suggest using multiple modalities to obtain better features. They utilized videos of lips and their corresponding speech for multi-modal representation learning. Miech et al. [139] proposed a mixture-of embedding-experts model to combine multiple modalities including motion, appearance, audio, and face features and learn the shared embedding space between these modalities and text. Roig et al. [175] combines multiple modalities such as action, scene, object and acoustic event features in a pyramidal structure for action recognition. They show that adding each modality improves the final action recognition accuracy. Both CE [129] and MMT [55], follow a similar research line to [139] where the goal is to combine multiple-modalities to obtain a comprehensive representation of video for joint video-text representation learning. Piergiovanni et al. [166] utilized textual data together with video data to learn a joint embedding space. Using this learned joint embedding space, the method is capable of doing zero-shot action recognition. This line of research is promising due to the availability of strong semantic extraction models and also success of transformers on both vision and language tasks.


Multi-modality for self-supervised video representation learning Most videos contain multiple modalities such as audio or text/caption. These modalities are great source of supervision for learning video representations [3, 144, 154, 2, 162]. Korbar et al. [105] incorporated the natural synchronization between audio and video as a supervision signal in their contrastive learning objective for selfsupervised representation learning. In multi-modal selfsupervised representation learning, the dataset plays an important role. VideoBERT [195] collected 310K cooking videos from YouTube. However, this dataset is not publicly available. Similar to BERT, VideoBERT used a “masked language model” training objective and also quantized the visual representations into “visual words”. Miech et al. [140] introduced HowTo100M dataset in 2019. This dataset includes 136M clips from 1.22M videos with their corresponding text. They collected the dataset from YouTube with the aim of obtaining instructional videos (how to perform an activity). In total, it covers 23.6K instructional tasks. MIL-NCE [138] used this dataset for self-supervised cross-modal representation learning. They tackled the problem of visually misaligned narrations, by considering multiple positive pairs in the contrastive learning objective. ActBERT [275], utilized HowTo100M dataset for pre-training of the model in a self-supervised way. They incorporated visual, action, text and object features for cross modal representation learning. Recently AVLnet [176] and MMV [2] considered three modalities visual, audio and language for self-supervised representation learning. This research direction is also increasingly getting more attention due to the success of contrastive learning on many vision and language tasks and the access to the abundance of unlabeled multimodal video data on platforms such as YouTube, Instagram or Flickr. The top section of Table 6 compares multi-modal self-supervised representation learning methods. We will discuss more work on video-only representation learning in the next section.

多模式的自我监督视频表示学习,大多数视频包含多种模式,例如音频或文本/字幕。这些模式是学习视频表示的重要监督来源[3,144,154,2,162]。 Korbar等人[105]将音频和视频之间的自然同步作为一种监督信号纳入其自学习式表示学习的对比学习目标中。在多模式自我监督表示学习中,数据集起着重要作用。VideoBERT[195]从YouTube收集了310K烹饪视频。但是,此数据集不是公开可用的。与BERT相似,VideoBERT使用“掩盖语言模型”训练目标,并将视觉表示量化为“视觉单词”。Miech等人[140]在2019年引入了HowTo100M数据集。该数据集包括来自1.22M视频的1.36亿个剪辑及其相应的文本。他们从YouTube收集了数据集,目的是获得教学视频(如何进行活动)。总共涵盖23.6万个教学任务。MIL-NCE[138]使用此数据集进行自我监督的交叉模式表示学习。他们通过在对比学习目标中考虑多个正对来解决视觉上错位的叙述问题。ActBERT[275]利用HowTo100M数据集以自我监督的方式对模型进行了预训练。他们结合了视觉,动作,文本和对象特征,以进行跨模式表示学习。最近,AVLnet[176]和MMV[2]考虑了三种模式的视觉,音频和语言进行自我监督的表示学习。由于在许多视觉和语言任务上进行了对比学习的成功以及在YouTube,Instagram或Flickr等平台上访问了大量未标记的多模式视频数据,该研究方向也越来越受到关注。表6的顶部比较了多模式自我监督表示学习方法。我们将在下一节中讨论有关纯视频表示学习的更多工作。

Table 6. Comparison of self-supervised video representation learning methods. Top section shows the multi-modal video representation learning approaches and bottom section shows the video-only representation learning methods. From left to right, we show the selfsupervised training setting, e.g. dataset, modalities, resolution, and architecture. Two last right columns show the action recognition results on two datasets UCF101 and HMDB51 to measure the quality of self-supervised pre-trained model. HTM: HowTo100M, YT8M: YouTube8M, AS: AudioSet, IG-K: IG-Kinetics, K400: Kinetics400 and K600: Kinetics600.

Method Dataset Video Audio Text Size Backbone Venue UCF101 Linear FT HMDB51 Linear FT
AVTS [105] K400 X X - 224 R(2+1)D-18 NeurIPS 2018 - 86.2 - 52.3
AVTS [105] AS X X - 224 R(2+1)D-18 NeurIPS 2018 - 89.1 - 58.1
CBT [194] K600+ X - X 112 S3D arXiv 2019 54.0 79.5 29.5 44.6
MIL-NCE [138] HTM X - X 224 S3D CVPR 2020 82.7 91.3 53.1 61.0
ELO [162] YT8M X X - 224 R(2+1)D-50 CVPR 2020 – 93.8 64.5 67.4
XDC [3] K400 X X - 224 R(2+1)D-18 NeurIPS 2020 - 86.8 - 52.6
XDC [3] AS X X - 224 R(2+1)D-18 NeurIPS 2020 - 93.0 - 63.7
XDC [3] IG65M X X - 224 R(2+1)D-18 NeurIPS 2020 - 94.6 - 66.5
XDC [3] IG-K X X - 224 R(2+1)D-18 NeurIPS 2020 - 95.5 - 68.9
AVID [144] AS X X - 224 R(2+1)D-50 arXiv 2020 - 91.5 - 64.7
GDT [154] K400 X X - 112 R(2+1)D-18 arXiv 2020 - 89.3 - 60.0
GDT [154] AS X X - 112 R(2+1)D-18 arXiv 2020 - 92.5 - 66.1
GDT [154] IG65M X X - 112 R(2+1)D-18 arXiv 2020 - 95.2 - 72.8
MMV [2] AS+HTM X X X 200 S3D NeurIPS 2020 89.6 92.5 62.6 69.6
MMV [2] AS+HTM X X X 200 TSM-50x2 NeurIPS 2020 91.8 95.2 67.1 75.0
___ ___ ___ ___ ___ ___ ___
OPN [115] UCF101 X - - 227 VGG ICCV 2017 - 59.6 - 23.8
3D-RotNet [94] K400 X - - 112 R3D arXiv 2018 - 62.9 - 33.7
ST-Puzzle [102] K400 X - - 224 R3D AAAI 2019 - 63.9 - 33.7
VCOP [240] UCF101 X - - 112 R(2+1)D CVPR 2019 - 72.4 - 30.9
DPC [71] K400 X - - 128 R-2D3D ICCVW 2019 - 75.7 - 35.7
SpeedNet [6] K400 X - - 224 S3D-G CVPR 2020 - 81.1 - 48.8
MemDPC [72] K400 X - - 224 R-2D3D ECCV 2020 54.1 86.1 30.5 54.5
CoCLR [73] K400 X - - 128 S3D NeurIPS 2020 74.5 87.9 46.1 54.6
CVRL [167] K400 X - - 224 R3D-50 arXiv 2020 - 92.2 - 66.7
CVRL [167] K600 X - - 224 R3D-50 arXiv 2020 - 93.4 - 68.0

Self-supervised video representation learning(自我监督的视频表示学习)

Self-supervised learning has attracted more attention recently as it is able to leverage a large amount of unlabeled data by designing a pretext task to obtain free supervisory signals from data itself. It first emerged in image representation learning. On images, the first stream of papers aimed at designing pretext tasks for completing missing information, such as image coloring [262] and image reordering [153, 61, 263]. The second stream of papers uses instance discrimination [235] as the pretext task and contrastive losses [235, 151] for supervision. They learn visual representation by modeling visual similarity of object instances without class labels [235, 75, 201, 18, 17].

自我监督学习最近吸引了更多关注,因为它能够通过设计预置任务从数据本身获取免费的监督信号来利用大量未标记的数据。它首先出现在图像表示学习中。关于图像,第一批论文旨在设计用于完成缺失信息的预置任务,例如image coloring[262]和image reordering[153、61、263]。第二篇论文使用instance discrimination[235]作为预置任务,使用contrastive losses[235,151]进行监督。他们通过对没有类标签的对象实例的视觉相似性进行建模来学习视觉表示[235、75、201、18、17]。

Self-supervised learning is also viable for videos. Compared with images, videos has another axis, temporal dimension, which we can use to craft pretext tasks. Information completion tasks for this purpose include predicting the correct order of shuffled frames [141, 52] and video clips [240]. Jing et al. [94] focus on the spatial dimension only by predicting the rotation angles of rotated video clips. Combining temporal and spatial information, several tasks have been introduced to solve a space-time cubic puzzle, anticipate future frames [208], forecast long-term motions [134] and predict motion and appearance statistics [211]. RSPNet [16] and visual tempo [247] exploit the relative speed between video clips as a supervision signal.

自我监督的学习对于视频也是可行的。与图像相比,视频具有另一个轴,即时间维度,我们可以使用它来制作预置任务。为此目的的信息完成任务包括预测shuffled frames[141、52]和视频剪辑[240]的正确顺序。Jing等人[94]仅通过预测旋转视频剪辑的旋转角度将注意力集中在空间维度上。结合时间和空间信息,已经引入了若干任务来解决时空立方难题,预期未来帧[208],预测长期运动[134]以及预测运动和外观统计[211]。RSPNet[16]和visual tempo[247]利用视频剪辑之间的相对速度作为监督信号。

The added temporal axis can also provide flexibility in designing instance discrimination pretexts [67, 167]. Inspired by the decoupling of 3D convolution to spatial and temporal separable convolutions [239], Zhang et al. [266] proposed to decouple the video representation learning into two sub-tasks: spatial contrast and temporal contrast. Recently, Han et al. [72] proposed memory augmented dense predictive coding for self-supervised video representation learning. They split each video into several blocks and the embedding of future block is predicted by the combination of condensed representations in memory.


The temporal continuity in videos inspires researchers to design other pretext tasks around correspondence. Wang et al. [221] proposed to learn representation by performing cycle-consistency tracking. Specifically, they track the same object backward and then forward in the consecutive video frames, and use the inconsistency between the start and end points as the loss function. TCC [39] is a concurrent paper. Instead of tracking local objects, [39] used cycle-consistency to perform frame-wise temporal alignment as a supervision signal. [120] was a follow-up work of [221], and utilized both object-level and pixel-level correspondence across video frames. Recently, long-range temporal correspondence is modelled as a random walk graph to help learning video representation in [87].

视频中的时间连续性激励研究人员围绕函授设计其他预置任务。Wang等人[221]提出通过执行循环一致性跟踪来学习表示。具体来说,它们在连续的视频帧中向后跟踪相同的对象,然后向前跟踪,并将起点和终点之间的不一致用作损失函数。TCC[39]是一篇并发的论文。而不是跟踪局部对象,[39]使用循环一致性来执行逐帧时间对齐作为监督信号。[120]是[221]的后续工作,并利用了视频帧之间的对象级和像素级对应关系。最近,在[87]中,远程时间对应关系被建模为random walk graph,以帮助学习视频表示。

We compare video self-supervised representation learning methods at the bottom section of Table 6. A clear trend can be observed that recent papers have achieved much better linear evaluation accuracy and fine-tuning accuracy comparable to supervised pre-training. This shows that selfsupervised learning could be a promising direction towards learning better video representations.



In this survey, we present a comprehensive review of 200+ deep learning based recent approaches to video action recognition. Although this is not an exhaustive list, we hope the survey serves as an easy-to-follow tutorial for those seeking to enter the field, and an inspiring discussion for those seeking to find new research directions.



We would like to thank Peter Gehler, Linchao Zhu and Thomas Brady for constructive feedback and fruitful discussions.

我们要特别感谢Peter Gehler,Linchao Zhu和Thomas Brady的建设性反馈和富有成果的讨论。