论文翻译: A Comprehensive Study of Deep Video Action Recognition

A Comprehensive Study of Deep Video Action Recognition
一个全面的深度视频动作识别研究综述

Abstract(摘要)

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling longrange temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video actin recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.

视频动作识别是视频理解的代表性任务之一。在过去的十年中,由于深度学习的出现,我们见证了视频动作识别领域的巨大发展。但是同样我们遇到了新的挑战,包括视频数据中的长跨度时序数据的建模,昂贵的计算成本,以及不同数据集及评估方法带来的多变的结论。在本论文中,我们对现有的超过200篇基于深度学习的视频动作识别论文做了一个详细的调研。我们首先介绍了17个影响模型设计的视频动作识别数据集。然后,我们按时间顺序介绍视频动作识别模型:从早期开始尝试深度模型开始,然后双流网络模型,接着是采用3D卷积核,最后是效率计算模型。此外,我们使用目前流行的方法在一些有代表性的数据集上进行了基准测试,并发布了可复现的代码。最后,我们讨论了一些已知问题,同时阐明了视频动作识别领域新研究思路的机遇。

Introduction(引言)

One of the most important tasks in video understanding is to understand human actions. It has many real-world applications, including behavior analysis, video retrieval, human-robot interaction, gaming, and entertainment. Human action understanding involves recognizing, localizing, and predicting human behaviors. The task to recognize human actions in a video is called video action recognition. In Figure 1, we visualize several video frames with the associated action labels, which are typical human daily activities such as shaking hands and riding a bike.

视频理解任务中最重要的之一就是理解人类的行为。其可以应用到许多实际的场景中去,包括行为分析,视频检索,人机交互,游戏和娱乐。人类行为理解涉及识别、定位和预测人类行为。在视频中识别人类行为就称之为视频动作识别。我们将一些人类日常活动(例如握手和骑自行车)的视频帧进行可视化操作,并给予其关联的动作标签,如图1所示。

Over the last decade, there has been growing research interest in video action recognition with the emergence of high-quality large-scale action recognition datasets. We summarize the statistics of popular action recognition datasets in Figure 2. We see that both the number of videos and classes increase rapidly, e.g, from 7K videos over 51 classes in HMDB51 [109] to 8M videos over 3,862 classes in YouTube8M [1]. Also, the rate at which new datasets are released is increasing: 3 datasets were released from 2011 to 2015 compared to 13 released from 2016 to 2020.

在过去的十年中,随着高质量大规模的视频动作识别数据集的出现,人们对视频动作识别领域的研究兴趣日益增长。我们统计汇总了流行的动作识别数据集,如图2所示。我们发现视频和类别的数量都在迅速增加,例如,从HMDB51[109]数据集的51个类别7000个视频,到YouTube8M[1]数据集的3862个类别800万个视频。此外,新数据集的发布速度正在上升:2011年至2015年间发布了3个数据集,而2016年至2020年间发布了13个数据集。

Thanks to both the availability of large-scale datasets and the rapid progress in deep learning, there is also a rapid growth in deep learning based models to recognize video actions. In Figure 3, we present a chronological overview of recent representative work. DeepVideo [99] is one of the earliest attempts to apply convolutional neural networks to videos. We observed three trends here. The first trend started by the seminal paper on Two-Stream Networks [187], adds a second path to learn the temporal information in a video by training a convolutional neural network on the optical flow stream. Its great success inspired a large number of follow-up papers, such as TDD [214], LRCN [37], Fusion [50], TSN [218], etc. The second trend was the use of 3D convolutional kernels to model video temporal information, such as I3D [14], R3D [74], S3D [239], Non-local [219], SlowFast [45], etc. Finally, the third trend focused on computational efficiency to scale to even larger datasets so that they could be adopted in real applications. Examples include Hidden TSN [278], TSM [128], X3D [44], TVN [161], etc.

由于大规模数据集的可用性和深度学习的迅速发展,基于深度学习的视频动作识别模型也得到了快速发展。在图3中,我们按时间顺序概述了视频动作识别领域的代表性模型。DeepVideo[99]是最早尝试将卷积神经网络应用于视频的模型之一。我们在这里观察到了三种路线。第一个路线由开创性的论文——Two-Stream网络[187]提出,它增加了第二条路径,即在光流上训练卷积神经网络来学习视频中的时域信息。它的巨大成功启发了许多跟进此方法的论文,例如TDD[214]、LRCN[37]、Fusion[50]、TSN[218]等。第二个路线是使用3D卷积核来建立视频时序信息模型,例如I3D[14]、R3D[74]、S3D[239]、Non-local[219]、SlowFast[45]等。最后一种路线则侧重于提升计算效率以扩展到更大的数据集,使其可以在实际应用中被采用。例如Hidden-TSN[278]、TSM[128],X3D[44],TVN[161]等。

Despite the large number of deep learning based models for video action recognition, there is no comprehensive survey dedicated to these models. Previous survey papers either put more efforts into hand-crafted features [77, 173] or focus on broader topics such as video captioning [236], video prediction [104], video action detection [261] and zero-shot video action recognition [96]. In this paper:

尽管有大量的基于深度学习的视频动作识别模型,但没有针对这些模型的全面调研。先前的综述论文要么将更多的精力投入到手工提取特征的办法[77,173]中,要么专注于更广泛的领域,例如视频字幕[236]、视频预测[104]、视频动作检测[261]和零镜头视频动作识别[96]。在本文中:

  • We comprehensively review over 200 papers on deep learning for video action recognition. We walk the readers through the recent advancements chronologically and systematically, with popular papers explained in detail.
  • We benchmark widely adopted methods on the sameset of datasets in terms of both accuracy and efficiency. We also release our implementations for full reproducibility.
  • We elaborate on challenges, open problems, and opportunities in this field to facilitate future research.
  • 我们全面调研了200多篇基于深度学习的视频动作识别论文,我们按时间顺序系统地引导读者浏览这个领域的最新进展,并详细解释了一些杰出的论文。
  • 我们使用相同的数据集对这些方法的准确性和效率进行基准测试。我们还发布了可完全复现实验结论的实现代码。
  • 我们详细介绍了该领域中的挑战,已知问题和未来发展机遇,以促进未来的研究。

The rest of the survey is organized as following. We first describe popular datasets used for benchmarking and existing challenges in section 2. Then we present recent advancements using deep learning for video action recognition in section 3, which is the major contribution of this survey. In section 4, we evaluate widely adopted approaches on standard benchmark datasets, and provide discussions and future research opportunities in section 5.

其余的内容安排如下。我们首先在第2节中介绍用于基准测试的流行数据集以及该领域的现有挑战。然后在第3节中介绍使用深度学习进行视频动作识别任务的发展,这是本次综述的主要贡献。在第4节中我们广泛评估了各方法在基准数据集上的表现。在第5节中我们讨论了目前的现状,并提出未来的发展机遇。

Datasets and Challenges(数据集和挑战)

Datasets(数据集)

Deep learning methods usually improve in accuracy when the volume of the training data grows. In the case of video action recognition, this means we need large-scale annotated datasets to learn effective models.

当训练数据量增加时,深度学习方法通常可以提高准确性。在视频动作识别的情况下,这意味着我们需要大规模的带标注的数据集才能训练有效的模型。

For the task of video action recognition, datasets are often built by the following process: (1) Define an action list, by combining labels from previous action recognition datasets and adding new categories depending on the use case. (2) Obtain videos from various sources, such as YouTube and movies, by matching the video title/subtitle to the action list. (3) Provide temporal annotations manually to indicate the start and end position of the action, and (4) finally clean up the dataset by de-duplication and filtering out noisy classes/samples. Below we review the most popular large-scale video action recognition datasets in Table 1 and Figure 2.

对于视频动作识别任务,数据集通常通过以下过程来创建:(1)通过组合现有数据集的动作标签和一些自己需要用到的动作标签,来定义动作列表;(2)将YouTube或电影等各种来源的大量视频的标题与字幕和动作列表关键词匹配;(3)手动标注动作的开始和结束位置;(4)最后清理数据集,删除重复数据并筛选出有噪音的类别和样本。下面我们回顾最受欢迎的大型视频动作识别数据集,如表1及图2所示。

Table 1. A list of popular datasets for video action recognition
表1:流行的视频动作识别数据集列表

Dataset Year # Samples Ave. Len # Actions
HMDB51 [109] 2011 7K ~5s 51
UCF101 [190] 2012 13.3K ~6s 101
Sports1M [99] 2014 1.1M ~5.5m 487
ActivityNet [40] 2015 28K [5,10]m 200
YouTube8M [1] 2016 8M 229.6s 3862
Charades [186] 2016 9.8K 30.1s 157
Kinetics400 [100] 2017 306K 10s 400
Kinetics600 [12] 2018 482K 10s 600
Kinetics700 [13] 2019 650K 10s 700
Sth-Sth V1 [69] 2017 108.5K [2,6]s 174
Sth-Sth V2 [69] 2017 220.8K [2,6]s 174
AVA [70] 2017 385K 15m 80
AVA-kinetics [117] 2020 624K 15m,10s 80
MIT [142] 2018 1M 3s 339
HACS Clips [267] 2019 1.55M 2s 200
HVU [34] 2020 572K 10s 739
AViD [165] 2020 450K [3,15]s 887

HMDB51 [109] was introduced in 2011. It was collected mainly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6,849 clips divided into 51 action categories, each containing a minimum of 101 clips. The dataset has three official splits. Most previous papers either report the top-1 classification accuracy on split 1 or the average accuracy over three splits.

HMDB51[109]于2011年推出。它主要从电影中收集,另外很少一部分从公共数据库(如Prelinger archive、YouTube和Google videos)中收集。数据集包含51个动作分类标签和6849个视频剪辑,每个分类至少包含101个视频剪辑。数据集具有三个官方的划分分段。以前的大多数论文甚至包括最佳分类准确率的论文都报告了基于分段1或三个分段的平均准确率。

UCF101 [190] was introduced in 2012 and is an extension of the previous UCF50 dataset. It contains 13,320 videos from YouTube spreading over 101 categories of human actions. The dataset has three official splits similar to HMDB51, and is also evaluated in the same manner.

UCF101[190]于2012年推出,是对先前的UCF50数据集的扩展。它包含来自YouTube的13320部视频,内容涉及101种动作。该数据集具有类似于HMDB51的三个官方划分分段,并且也以相同的方式进行评估。

Sports1M [99] was introduced in 2014 as the first largescale video action dataset which consisted of more than 1 million YouTube videos annotated with 487 sports classes. The categories are fine-grained which leads to low interclass variations. It has an official 10-fold cross-validation split for evaluation.

Sports1M[99]于2014年推出,是第一个大规模的视频动作数据集,包含超过100万个YouTube视频,并带有487个动作分类标签。但是动作分类是细粒度的,这导致类间差异较小。它具有官方的10个交叉验证划分分段以进行评估。

ActivityNet [40] was originally introduced in 2015 and the ActivityNet family has several versions since its initial launch. The most recent ActivityNet 200 (V1.3) contains 200 human daily living actions. It has 10,024 training, 4,926 validation, and 5,044 testing videos. On average there are 137 untrimmed videos per class and 1.41 activity instances per video.

ActivityNet[40]最初于2015年推出,自其最初发布以来,ActivityNet系列有多个版本。最新的ActivityNet 200(V1.3)包含200种人类日常活动。它具有10024个训练视频,4926个验证视频和5044个测试视频。每个动作分类平均含有137个未修剪的视频,每个视频平均含有1.41种动作分类。

YouTube8M [1] was introduced in 2016 and is by far the largest-scale video dataset that contains 8 million YouTube videos (500K hours of video in total) and annotated with 3,862 action classes. Each video is annotated with one or multiple labels by a YouTube video annotation system. This dataset is split into training, validation and test in the ratio 70:20:10. The validation set of this dataset is also extended with human-verified segment annotations to provide temporal localization information.

YouTube8M[1]于2016年推出,是迄今为止规模最大的视频数据集,包含800万个YouTube视频(总共50万小时),包含3862种动作分类。每个视频都由YouTube视频标注系统生成一或多个标签。 该数据集按70:20:10的比例分为训练集、验证集和测试集。该数据集的验证集还通过人工验证的手段对标注进行了扩展,以提供准确的时间定位。

Charades [186] was introduced in 2016 as a dataset for real-life concurrent action understanding. It contains 9,848 videos with an average length of 30 seconds. This dataset includes 157 multi-label daily indoor activities, performed by 267 different people. It has an official train-validation split that has 7,985 videos for training and the remaining 1,863 for validation.

Charades[186]在2016年被引入,用于现实生活中的实时行为理解。它包含9848个视频,每个视频平均时长为30秒。该数据集包括267个不同的人录制的157种多标签的日常室内活动。它有一个官方的训练/验证分段,其中有7985个视频用于训练,其余的1863个用于验证。

Kinetics Family is now the most widely adopted benchmark. Kinetics400 [100] was introduced in 2017 and it consists of approximately 240k training and 20k validation videos trimmed to 10 seconds from 400 human action categories. The Kinetics family continues to expand, with Kinetics-600 [12] released in 2018 with 480K videos and Kinetics700[13] in 2019 with 650K videos.

Kinetics Family是现在是最被广泛采用的基准测试数据集。Kinetics400[100]于2017年推出,它包含大约24万个训练视频和20万个验证视频,这些视频分为400种动作类别,并统一剪辑到10秒。Kinetics系列不断扩大,于2018年发布了Kinetics-600[12],包含48万个视频;在2019年发布了Kinetics700[13],具有65万个视频。

20BN-Something-Something [69] V1 was introduced in 2017 and V2 was introduced in 2018. This family is another popular benchmark that consists of 174 action classes that describe humans performing basic actions with everyday objects. There are 108,499 videos in V1 and 220,847 videos in V2. Note that the Something-Something dataset requires strong temporal modeling because most activities cannot be inferred based on spatial features alone (e.g. opening something, covering something with something).

20BN-Something-Something[69]V1版本于2017年推出,V2版本于2018年推出。该系列是另一个受欢迎的基准测试数据集,由174个动作分类组成,描述了人类对日常物体的使用动作情况。V1版本中有108499个视频,V2版本中有220847个视频。请注意,Something-Something数据集需要具有强大的时序信息处理能力的模型,因为大多数动作不能仅基于空间特征信息来推断(例如:打开某物,用某物覆盖某物)。

AVA [70] was introduced in 2017 as the first large-scale spatio-temporal action detection dataset. It contains 430 15-minute video clips with 80 atomic actions labels (only 60 labels were used for evaluation). The annotations were provided at each key-frame which lead to 214,622 training, 57,472 validation and 120,322 testing samples. The AVA dataset was recently expanded to AVA-Kinetics with 352,091 training, 89,882 validation and 182,457 testing samples [117].

AVA[70]于2017年推出,是第一个大规模的时空行为检测数据集。它包含430个15分钟的视频剪辑,其中包含80个原子动作标签(仅60个标签用于评估)。并在每个关键帧处都提供了标注,这些标注可提供214622个培训数据,57472个验证数据和120322个测试数据。AVA数据集最近已扩展到AVA-Kinetics数据集,其可以提供352091个训练数据,89882个验证数据和182457个测试数据[117]。

Moments in Time [142] was introduced in 2018 and it is a large-scale dataset designed for event understanding. It contains one million 3 second video clips, annotated with a dictionary of 339 classes. Different from other datasets designed for human action understanding, Moments in Time dataset involves people, animals, objects and natural phenomena. The dataset was extended to Multi-Moments in Time (M-MiT) [143] in 2019 by increasing the number of videos to 1.02 million, pruning vague classes, and increasing the number of labels per video.

Moments in Time[142]于2018年推出,它是一个大型的数据集,旨在事件理解任务。它包含一百万个3秒的视频剪辑,并含有339种分类。与其他为人类动作识别设计的其他数据集不同,Moments in Time数据集涉及人,动物,物体和自然现象。该数据集于2019年扩展为Multi-Moments in Time (M-MiT)[143]数据集,其包含102万个视频,并去除了一些模糊的分类,同时增加每个视频的标注数量。

HACS [267] was introduced in 2019 as a new large-scale dataset for recognition and localization of human actions collected from Web videos. It consists of two kinds of manual annotations. HACS Clips contains 1.55M 2-second clip annotations on 504K videos, and HACS Segments has 140K complete action segments (from action start to end) on 50K videos. The videos are annotated with the same 200 human action classes used in ActivityNet (V1.3) [40].

HACS[267]于2019年推出,是一个新的大规模数据集,用于识别和定位从Web视频中收集的人类行为。它由两种手动标注组成。HACS Clips包含155万包含标注的2秒视频剪辑,HACS Segments包含14万个完整的动作片段(从动作开始到结束)。这些视频使用与ActivityNet(V1.3)[40]相同的200种人类动作分类标注。

HVU [34] dataset was released in 2020 for multi-label multi-task video understanding. This dataset has 572K videos and 3,142 labels. The official split has 481K, 31K and 65K videos for train, validation, and test respectively. This dataset has six task categories: scene, object, action, event, attribute, and concept. On average, there are about 2,112 samples for each label. The duration of the videos varies with a maximum length of 10 seconds.

HVU[34]数据集于2020年发布,用于多标签多任务下的视频理解。该数据集包含57.2万个视频和3142种分类标签。官方分组提供了48.1万个视频用于训练,3.1万个视频用于验证以及6.5万个视频用于测试。该数据集具有六个任务类别:场景,对象,动作,事件,属性和概念。每个标签平均大约有2112个视频。视频的时长最长不超过10秒。

AViD [165] was introduced in 2020 as a dataset for anonymized action recognition. It contains 410K videos for training and 40K videos for testing. Each video clip duration is between 3-15 seconds and in total it has 887 action classes. During data collection, the authors tried to collect data from various countries to deal with data bias. They also remove face identities to protect privacy of video makers. Therefore, AViD dataset might not be a proper choice for recognizing face-related actions.

AViD[165]于2020年作为匿名动作识别的数据集被引入。它包含41万个视频用于训练,4万个视频用于测试。每个视频剪辑的持续时间在3到15秒之间,共有887种动作分类。在数据收集过程中,作者试图从各个国家收集数据以应对数据偏差。他们还删除了脸部身份,以保护视频制作者的隐私。因此,AViD数据集可能不是识别面部动作的合适选择。

Before we dive into the chronological review of methods, we present several visual examples from the above datasets in Figure 4 to show their different characteristics. In the top two rows, we pick action classes from UCF101 [190] and Kinetics400 [100] datasets. Interestingly, we find that these actions can sometimes be determined by the context or scene alone. For example, the model can predict the action riding a bike as long as it recognizes a bike in the video frame. The model may also predict the action cricket bowling if it recognizes the cricket pitch. Hence for these classes, video action recognition may become an object/scene classification problem without the need of reasoning motion/temporal information. In the middle two rows, we pick action classes from Something-Something dataset [69]. This dataset focuses on human-object interaction, thus it is more fine-grained and requires strong temporal modeling. For example, if we only look at the first frame of dropping something and picking something up without looking at other video frames, it is impossible to tell these two actions apart. In the bottom row, we pick action classes from Moments in Time dataset [142]. This dataset is different from most video action recognition datasets, and is designed to have large inter-class and intra-class variation that represent dynamical events at different levels of abstraction. For example, the action climbing can have different actors (person or animal) in different environments (stairs or tree).

在我们按时间顺序深入探究方法之前,我们提供了某些上述数据集中的几个直观示例,如图4所示,以展示它们的不同特征。在前两行中,我们从UCF101[190]和Kinetics400[100]数据集中选择动作分类。有趣的是,我们发现这些动作有时可以仅由上下文或场景来确定。例如,只要模型能够识别出视频帧中的自行车,该模型就可以预测该动作是骑自行车。如果模型能够识别出板球场地,它也可以预测动作为击打板球。因此,对于这些类别,视频动作识别可能成为对象/场景分类问题,而无需推理运动/时间信息。在中间的两行中,我们从Something-Something数据集中选择动作分类[69]。该数据集专注于人与对象的交互,因此它的粒度更细,并且需要强大的能够处理时序信息的模型。例如,如果我们只看放东西的第一帧而没有看其他视频帧的话,就不可能将这两个动作区分开。在最下面一行中,我们从Moments in Time数据集中选择动作分类[142]。此数据集与大多数视频动作识别数据集不同,它被设计成为在类内和类间有巨大的差异,这个差异代表不同抽象级别的动态事件。例如,动作攀登在不同的环境(楼梯或树)中可以有不同的执行者(人或动物)。

Challenges(挑战)

There are several major challenges in developing effective video action recognition algorithms.

开发有效的视频动作识别算法存在几个主要挑战。

In terms of dataset, first, defining the label space for training action recognition models is non-trivial. It’s because human actions are usually composite concepts and the hierarchy of these concepts are not well-defined. Second, annotating videos for action recognition are laborious (e.g., need to watch all the video frames) and ambiguous (e.g, hard to determine the exact start and end of an action). Third, some popular benchmark datasets (e.g., Kinetics family) only release the video links for users to download and not the actual video, which leads to a situation that methods are evaluated on different data. It is impossible to do fair comparisons between methods and gain insights.

在数据集方面,首先,为训练动作识别模型定义标签并非易事。这是因为人类的行为动作通常是复合概念,而这些概念的层次结构还没有很好地定义。其次,为动作识别视频添加标注既费力(例如,需要观看所有视频帧),又模棱两可(例如,难以确定动作的确切开始和结束)。第三,某些流行的基准数据集(例如Kinetics系列)仅发布供用户下载的视频链接,而不发布实际的视频,从而发生不同方法评估的数据对象不统一的情况。在方法之间进行公正的比较并获得见解是很难做到的。

In terms of modeling, first, videos capturing human actions have both strong intra- and inter-class variations. People can perform the same action in different speeds under various viewpoints. Besides, some actions share similar movement patterns that are hard to distinguish. Second, recognizing human actions requires simultaneous understanding of both short-term action-specific motion information and long-range temporal information. We might need a sophisticated model to handle different perspectives rather than using a single convolutional neural network. Third, the computational cost is high for both training and inference, hindering both the development and deployment of action recognition models. In the next section, we will demonstrate how video action recognition methods developed over the last decade to address the aforementioned challenges.

在建模方面,首先,人类行为的视频在类内和类间都有很大的差异。人们可以在不同情况下以不同的速度执行相同的动作。此外,某些行为具有相似的动作模式,难以区分。其次,识别人类动作需要同时了解短期的特定动作的运动信息和长跨度的时间信息。我们可能需要一个复杂的模型来处理不同的情况,而不是使用一个单一的卷积神经网络。第三,训练和推理的计算成本都很高,从而阻碍了动作识别模型的开发和部署。在下一节中,我们将展示在过去十年中视频动作识别领域的发展过程,及其如何应对上述的挑战。

An Odyssey of Using Deep Learning for Video Action Recognition(使用深度学习进行视频动作识别的冒险之旅)

In this section, we review deep learning based methods for video action recognition from 2014 to present and introduce the related earlier work in context.

在本节中,我们回顾了从2014年开始至今的基于深度学习的视频动作识别方法,并介绍了其相关的早期工作。

From handcrafted features to CNNs(从手工提取特征到卷积神经网络)

Despite there being some papers using Convolutional Neural Networks (CNNs) for video action recognition, [200, 5, 91], hand-crafted features [209, 210, 158, 112], particularly Improved Dense Trajectories (IDT) [210], dominated the video understanding literature before 2015, due to their high accuracy and good robustness. However, handcrafted features have heavy computational cost [244], and are hard to scale and deploy.

尽管已经有论文开始使用卷积神经网络(CNN)进行视频动作识别[200,5,91],但是基于手工提取特征的方法[209,210,158,112],特别是IDT方法[210],由于其高准确性和良好的健壮性,在2015年之前在视频理解领域占据了主导地位。但是,手工提取特征的方法计算成本庞大[244],并且难以扩展和部署。

With the rise of deep learning [107], researchers started to adapt CNNs for video problems. The seminal work DeepVideo [99] proposed to use a single 2D CNN model on each video frame independently and investigated several temporal connectivity patterns to learn spatio-temporal features for video action recognition, such as late fusion, early fusion and slow fusion. Though this model made early progress with ideas that would prove to be useful later such as a multi-resolution network, its transfer learning performance on UCF101 [190] was 20% less than hand-crafted IDT features (65.4% vs 87.9%). Furthermore, DeepVideo [99] found that a network fed by individual video frames, performs equally well when the input is changed to a stack of frames. This observation might indicate that the learnt spatio-temporal features did not capture the motion well. It also encouraged people to think about why CNN models did not outperform traditional hand-crafted features in the video domain unlike in other computer vision tasks [107, 171].

随着深度学习的兴起[107],研究人员开始尝试使用CNN解决视频问题。DeepVideo[99]开创性的提出在每个视频帧上独立使用单个2D CNN模型,并研究了几种时间关联模式,以学习用于视频动作识别的时空特征,例如后期融合(late fusion),早期融合(early fusion)和缓慢融合(slow fusion)。尽管此模型的想法被后来的多流网络模型证明是有用的,但它在UCF101[190]上的迁移学习性能比手工提取特征的IDT方法低20%(65.4%对87.9%)。此外,DeepVideo[99]发现,当输入单视频帧时,其性能也和输入一堆帧的性能一样好。该实验结果可能表明,学习的时空特征不能很好地捕捉运动特征。它还鼓励人们思考,为什么CNN模型不像其他计算机视觉任务那样在视频领域超过手工提取特征的方法[107,171]。

Two-stream networks(双流网络)

Since video understanding intuitively needs motion information, finding an appropriate way to describe the temporal relationship between frames is essential to improving the performance of CNN-based video action recognition.

由于视频理解需要直观的运动信息,因此找到一种合适的方式来描述帧之间的时序特征,对于基于CNN的视频动作识别方法的性能来说至关重要。

Optical flow [79] is an effective motion representation to describe object/scene movement. To be precise, it is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. We show several visualizations of optical flow in Figure 5. As we can see, optical flow is able to describe the motion pattern of each action accurately. The advantage of using optical flow is it provides orthogonal information compared to the the RGB image. For example, the two images on the bottom of Figure 5 have cluttered backgrounds. Optical flow can effectively remove the nonmoving background and result in a simpler learning problem compared to using the original RGB images as input. In addition, optical flow has been shown to work well on video problems. Traditional hand-crafted features such as IDT [210] also contain optical-flow-like features, such as Histogram of Optical Flow (HOF) and Motion Boundary Histogram (MBH).

光流[79]是描述物体/场景运动的一种有效运动表示方法。准确地说,它是视觉场景中物体、物体面和物体边缘的显著运动的模式,该运动是由观察者和视觉场景之间的相对运动引起的。我们在图5中展示了光流的几种可视化效果。我们可以看到,光流能够准确地描述每个动作的运动模式。使用光流的优点是,与RGB图像相比,它提供了正交信息。例如,图5下方的两个图像背景杂乱无章。与使用原始RGB图像作为输入相比,光流可以有效去除静止的背景,使训练模型更简单。此外,光流已被证明可以很好地解决视频问题。传统的手工提取特征的方法(例如IDT [210])也包含了类似光流的特征,例如光流直方图(Histogram of Optical Flow, HOF)和运动边界直方图(Motion Boundary Histogram, MBH)。

Hence, Simonyan et al. [187] proposed two-stream networks, which included a spatial stream and a temporal stream as shown in Figure 6. This method is related to the two-streams hypothesis [65], according to which the human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognizes motion). The spatial stream takes raw video frame(s) as input to capture visual appearance information. The temporal stream takes a stack of optical flow images as input to capture motion information between video frames. To be specific, [187] linearly rescaled the horizontal and vertical components of the estimated flow (i.e., motion in the x-direction and y-direction) to a [0, 255] range and compressed using JPEG. The output corresponds to the two optical flow images shown in Figure 6. The compressed optical flow images will then be concatenated as the input to the temporal stream with a dimension of H x W x 2L, where H,Wand L indicates the height, width and the length of the video frames. In the end, the final prediction is obtained by averaging the prediction scores from both streams.

因此,Simonyan等人[187]提出了双流网络,其中包括一个空间流和一个时间流,如图6所示。该方法与双流假说[65]有关,该假说认为,人类视觉皮层包含两个途径:腹侧流(识别物体)和背侧流(识别运动)。空间流将原始视频帧作为输入来捕获视觉外观信息。时间流将一堆光流图像作为输入,以捕获视频帧之间的运动信息。具体而言,[187]将估计流的水平和垂直分量(即,沿x方向和y方向的运动)线性地重新缩放到[0,255]范围,并使用JPEG压缩。输出的两个光流图像如图6所示。压缩后的光流图像将被连接作为时间流的输入,尺寸为H x W x 2L,其中H,W和L表示视频帧的高度、宽度和长度。最后,通过取两个流的平均预测得分作为最终预测结果。

By adding the extra temporal stream, for the first time, a CNN-based approach achieved performance similar to the previous best hand-crafted feature IDT on UCF101 (88.0% vs 87.9%) and on HMDB51 [109] (59.4% vs 61.1%). [187] makes two important observations. First, motion information is important for video action recognition. Second, it is still challenging for CNNs to learn temporal information directly from raw video frames. Pre-computing optical flow as the motion representation is an effective way for deep learning to reveal its power. Since [187] managed to close the gap between deep learning approaches and traditional hand-crafted features, many follow-up papers on twostream networks emerged and greatly advanced the development of video action recognition. Here, we divide them into several categories and review them individually.

通过添加额外的时间流,基于CNN的方法首次在UCF101(88.0%vs 87.9%)和HMDB51[109](59.4%vs 61.1%)上实现了与以前最佳的基于手工提取特征的IDT方法相似的性能。[187]提出了两个重要的观察。首先,运动信息对于视频动作识别很重要。其次,对于CNN而言,直接从原始视频帧中学习时间信息仍然具有挑战性。预计算光流作为运动表示信息是在视频动作识别领域使深度学习展现其力量的有效方法。由于[187]设法缩小了深度学习方法与传统手工制作功能之间的差距,因此出现了有关双流网络的许多后续论文,并极大地推动了视频动作识别领域的发展。在这里,我们将它们分为几类并分别进行调研。

Using deeper network architectures(使用更深的网络结构)

Two-stream networks [187] used a relatively shallow network architecture [107]. Thus a natural extension to the two-stream networks involves using deeper networks. However, Wang et al. [215] finds that simply using deeper networks does not yield better results, possibly due to overfitting on the small-sized video datasets [190, 109]. Recall from section 2.1, UCF101 and HMDB51 datasets only have thousands of training videos. Hence, Wang et al. [217] introduce a series of good practices, including crossmodality initialization, synchronized batch normalization, corner cropping and multi-scale cropping data augmentation, large dropout ratio, etc. to prevent deeper networks from overfitting. With these good practices, [217] was able to train a two-stream network with the VGG16 model [188] that outperforms [187] by a large margin on UCF101. These good practices have been widely adopted and are still being used. Later, Temporal Segment Networks (TSN) [218] performed a thorough investigation of network architectures, such as VGG16, ResNet [76], Inception [198], and demonstrated that deeper networks usually achieve higher recognition accuracy for video action recognition. We will describe more details about TSN in section 3.2.4.

双流网络[187]使用了相对较浅的网络体系结构[107]。因此,对双流网络的扩展自然就是使用更深的网络。然而,王等人[215]发现,仅使用更深的网络并不能产生更好的结果,这可能是由于数据集过小导致的过度拟合[190,109]。回想第2.1节,UCF101和HMDB51数据集只有数千个训练用视频。因此,王等人[217]引入了一系列良好实践,包括交叉初始化,同步批处理标准化,边角裁剪和多尺度裁剪数据增广,较大的dropout率等,以防止更深层的网络过度拟合。凭借这些良好的实践,[217]使用VGG16模型[188]来训练双流网络,该模型在UCF101上的表现大大优于[187]。这些良好做法已被广泛采用,并且仍在使用中。后来,Temporal Segment Networks(TSN)[218]对网络体系结构(例如VGG16,ResNet[76],Inception[198])进行了全面研究,并证明了更深的网络通常可以实现更高的视频动作识别精确度。我们将在3.2.4节中描述有关TSN的更多详细信息。

Two-stream fusion(双流混合)

Since there are two streams in a two-stream network, there will be a stage that needs to merge the results from both networks to obtain the final prediction. This stage is usually referred to as the spatial-temporal fusion step.

由于两流网络中有两个流,因此会有一个阶段需要合并两个网络的结果以取得最终预测结果。此阶段通常称为时空融合阶段。

The easiest and most straightforward way is late fusion, which performs a weighted average of predictions from both streams. Despite late fusion being widely adopted [187, 217], many researchers claim that this may not be the optimal way to fuse the information between the spatial appearance stream and temporal motion stream. They believe that earlier interactions between the two networks could benefit both streams during model learning and this is termed as early fusion.

最简单直接的方法是后期融合(late fusion),它的预测结果直接由两个流的加权平均值得到。尽管后期融合被广泛采用[187,217],但许多研究人员称这可能不是在空间表现流和时间运动流之间融合信息的最佳方法。他们认为,在两个网络之间进行早期交互可以在模型学习期间使两个流都受益,这被称为早期融合(early fusion)。

Fusion [50] is one of the first of several papers investigating the early fusion paradigm, including how to perform spatial fusion (e.g., using operators such as sum, max, bilinear, convolution and concatenation), where to fuse the network (e.g., the network layer where early interactions happen), and how to perform temporal fusion (e.g., using 2D or 3D convolutional fusion in later stages of the network). [50] shows that early fusion is beneficial for both streams to learn richer features and leads to improved performance over late fusion. Following this line of research, Feichtenhofer et al. [46] generalizes ResNet [76] to the spatiotemporal domain by introducing residual connections between the two streams. Based on [46], Feichtenhofer et al. [47] further propose a multiplicative gating function for residual networks to learn better spatio-temporal features. Concurrently, [225] adopts a spatio-temporal pyramid to perform hierarchical early fusion between the two streams.

Fusion[50]是研究早期融合方法的几篇论文之一,其中包括如何执行空间融合(例如,使用求和、最大值、双线性、卷积和级联等运算方法),在何处融合网络(例如,发生早期交互的网络层),以及如何执行时间融合(例如,在网络的后期使用2D或3D卷积融合)。[50]表明,早期融合有利于两个流学习更丰富的特征,并与晚期融合相比能够提高性能。按照这一研究路线,Feichtenhofer等人[46]通过引入两个流之间的剩余连接(residual connections),Feichtenhofer等人[46]将ResNet[76]推广到时空域。[47]进一步提出了用于残差网络的乘法门方法(multiplicative garing function),以学习更好的时空特征。同时,[225]采用时空金字塔在两个流之间执行分层的早期融合。

Recurrent neural networks(循环神经网络)

Since a video is essentially a temporal sequence, researchers have explored Recurrent Neural Networks (RNNs) for temporal modeling inside a video, particularly the usage of Long Short-Term Memory (LSTM) [78].

由于视频本质上是一个时间序列,因此研究人员已经探索了使用循环神经网络(RNN)用于视频内部的时间建模,特别是使用长短期记忆(LSTM)[78]。

LRCN [37] and Beyond-Short-Snippets [253] are the first of several papers that use LSTM for video action recognition under the two-stream networks setting. They take the feature maps from CNNs as an input to a deep LSTM network, and aggregate frame-level CNN features into videolevel predictions. Note that they use LSTM on two streams separately, and the final results are still obtained by late fusion. However, there is no clear empirical improvement from LSTM models [253] over the two-stream baseline [187]. Following the CNN-LSTM framework, several variants are proposed, such as bi-directional LSTM [205], CNN-LSTM fusion [56] and hierarchical multi-granularity LSTM network [118]. [125] described VideoLSTM which includes a correlation-based spatial attention mechanism and a lightweight motion-based attention mechanism. VideoLSTM not only show improved results on action recognition, but also demonstrate how the learned attention can be used for action localization by relying on just the action class label. Lattice-LSTM [196] extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations, so that it can accurately model long-term and complex motions. ShuttleNet [183] is a concurrent work that considers both feedforward and feedback connections in a RNN to learn long-term dependencies. FASTER [272] designed a FAST-GRU to aggregate clip-level features from an expensive backbone and a cheap backbone. This strategy reduces the processing cost of redundant clips and hence accelerates the inference speed.

LRCN[37]和Beyond-Short-Snippets[253]是在双流网络下使用LSTM进行视频动作识别的几篇开山之作。他们将来自CNN的特征图作为深层LSTM网络的输入,并将帧级别的CNN特征汇总到视频级别的预测中去。请注意,他们分别在两个流上使用LSTM,最终结果仍然可以通过后期融合获得。但是,LSTM模型[253]在基于双流网络[187]上没有明显的经验改进。遵循CNN-LSTM框架,提出了几种变体,例如双向LSTM[205],CNN-LSTM融合[56]和分层多粒度LSTM网络[118]。[125]描述了VideoLSTM,其中包括基于空间相关的注意力机制和基于轻量级运动相关的注意力机制。VideoLSTM不仅展示了动作识别方面的改进结果,而且还演示了如何仅依靠动作分类标签将所学的注意力机制用于动作定位。Lattice-LSTM[196]通过学习单个空间位置的存储单元的独立隐藏状态转换来扩展LSTM,因此它可以准确地模拟长期和复杂的运动。ShuttleNet[183]​​是一项并发工作,它考虑了RNN中的前馈和反馈连接,以学习长期依赖关系。FASTER[272]设计了一种FAST-GRU,将丰富的主干网络和贫瘠的主干网络聚合为片级别的特征。这种策略降低了冗余剪辑的处理成本,从而加快了训练速度。

However, the work mentioned above [37, 253, 125, 196, 183] use different two-stream networks/backbones. The differences between various methods using RNNs are thus unclear. Ma et al. [135] build a strong baseline for fair comparison and thoroughly study the effect of learning spatiotemporal features by using RNNs. They find that it requires proper care to achieve improved performance, e.g., LSTMs require pre-segmented data to fully exploit the temporal information. RNNs are also intensively studied in video action localization [189] and video question answering [274], but these are beyond the scope of this survey.

但是,上面提到的论文[37,253,125,196,183]使用了不同的两流网络/主干网络。因此,不清楚使用RNN的具体实现方法之间的差异。Ma等人[135]建立了一个公平的比较基准,通过使用RNN彻底研究学习时空特征的效果。他们发现需要适当提高注意力才能提高性能,例如LSTM需要预先分段的数据才能充分利用时间信息。RNN在视频动作定位[189]和视频问答[274]中也得到了深入研究,但是这些不是本综述的重点调研对象。

Segment-based methods(基于分段的方法)

Thanks to optical flow, two-stream networks are able to reason about short-term motion information between frames. However, they still cannot capture long-range temporal information. Motivated by this weakness of two-stream networks , Wang et al. [218] proposed a Temporal Segment Network (TSN) to perform video-level action recognition. Though initially proposed to be used with 2D CNNs, it is simple and generic. Thus recent work using either 2D or 3D CNNs, are still built upon this framework.

多亏了光流,双流网络才能够推理出帧之间的短期运动信息。但是,它们仍然无法捕获长时间信息。由于双流网络的这种弱点,Wang等人[218]提出了一个时间段网络(TSN)来执行视频级别的动作识别。尽管最初建议与2D CNN一起使用,但它是简单且通用的。 此,使用2D或3D CNN的最新工作仍然建立在此基础框架上。

To be specific, as shown in Figure 6, TSN first divides a whole video into several segments, where the segments distribute uniformly along the temporal dimension. Then TSN randomly selects a single video frame within each segment and forwards them through the network. Here, the network shares weights for input frames from all the segments. In the end, a segmental consensus is performed to aggregate information from the sampled video frames. The segmental consensus could be operators like average pooling, max pooling, bilinear encoding, etc. In this sense, TSN is capable of modeling long-range temporal structure because the model sees the content from the entire video. In addition, this sparse sampling strategy lowers the training cost over long video sequences but preserves relevant information.

具体来说,如图6所示,TSN首先将整个视频划分为几个片段,其中这些片段沿时间维度均匀分布。然后,TSN在每个段中随机选择一个视频帧,并将其输入至网络。在此,网络共享来自所有段的输入帧的权重。最后,聚集每个分段的信息来获得整个视频的信息。分段信息可以是诸如平均池化、最大池化,双线性编码等运算。从这个意义上讲,TSN能够对远程时间结构进行建模,因为该模型可以观察到整个视频的内容。另外,这种稀疏采样策略降低了长视频序列上的训练成本的同时,保留了相关信息。

Given TSN’s good performance and simplicity, most two-stream methods afterwards become segment-based two-stream networks. Since the segmental consensus is simply doing a max or average pooling operation, a feature encoding step might generate a global video feature and lead to improved performance as suggested in traditional approaches [179, 97, 157]. Deep Local Video Feature (DVOF) [114] proposed to treat the deep networks that trained on local inputs as feature extractors and train another encoding function to map the global features into global labels. Temporal Linear Encoding (TLE) network [36] appeared concurrently with DVOF, but the encoding layer was embedded in the network so that the whole pipeline could be trained end-to-end. VLAD3 and ActionVLAD [123, 63] also appeared concurrently. They extended the NetVLAD layer [4] to the video domain to perform video-level encoding, instead of using compact bilinear encoding as in [36]. To improve the temporal reasoning ability of TSN, Temporal Relation Network (TRN) [269] was proposed to learn and reason about temporal dependencies between video frames at multiple time scales. The recent state-of-the-art efficient model TSM [128] is also segment-based. We will discuss it in more detail in section 3.4.2.

鉴于TSN的良好性能和简便性,此后大多数双流方法都变成了基于分段的双流网络。由于分段信息只是在进行最大或平均池化操作,因此特征编码步骤会生成的全局视频特征,可能是导致性能改进的原因,如传统方法[179、97、157]所建议的那样。Deep Local Video Feature(DVOF)[114]提出将在本地输入上训练的深度网络视为特征提取器,并训练另一种编码方法以将全局特征映射到全局标签中。Temporal Linear Encoding(TLE)网络[36]与DVOF同时出现,但是其编码层嵌入于网络之中,因此可以对整个流程进行端到端的训练。VLAD3和ActionVLAD[123,63]也同时出现。他们将NetVLAD层[4]扩展到视频域以执行视频级编码,而不是像[36]中那样使用紧凑的双线性编码。为了提高TSN的时间推理能力,提出了Temporal Relation Network(TRN)[269],以学习和推理多个时间尺度上视频帧之间的时间依赖性。最近的SOTA(state-of-the-art)模型TSM[128]也是基于分段的。我们将在3.4.2节中更详细地讨论它。

Multi-stream networks(多流网络)

Two-stream networks are successful because appearance and motion information are two of the most important properties of a video. However, there are other factors that can help video action recognition as well, such as pose, object, audio and depth, etc.

双流网络之所以成功,是因为视觉和运动信息是视频的两个最重要属性。但是,还有其他因素也可以帮助视频动作识别,例如姿势、物体、音频和深度等。

Pose information is closely related to human action. We can recognize most actions by just looking at a pose (skeleton) image without scene context. Although there is previous work on using pose for action recognition [150, 246], P-CNN [23] was one of the first deep learning methods that successfully used pose to improve video action recognition. P-CNN proposed to aggregates motion and appearance information along tracks of human body parts, in a similar spirit to trajectory pooling [214]. [282] extended this pipeline to a chained multi-stream framework, that computed and integrated appearance, motion and pose. They introduced a Markov chain model that added these cues successively and obtained promising results on both action recognition and action localization. PoTion [25] was a follow-up work to P-CNN, but introduced a more powerful feature representation that encoded the movement of human semantic keypoints. They first ran a decent human pose estimator and extracted heatmaps for the human joints in each frame. They then obtained the PoTion representation by temporally aggregating these probability maps. PoTion is lightweight and outperforms previous pose representations [23, 282]. In addition, it was shown to be complementary to standard appearance and motion streams, e.g. combining PoTion with I3D [14] achieved state-of-the-art result on UCF101 (98.2%).

姿势信息与人类行为密切相关。我们可以通过仅查看不带场景上下文的姿势(骨骼)图像来识别大多数动作。尽管以前有使用姿势进行动作识别的工作[150,246],但P-CNN [23]是成功使用姿势改善视频动作识别的首批深度学习方法之一。P-CNN提出聚集运动和表现信息,来追踪人体骨骼部件,其思路类似于轨迹池化[214]。[282]将该流程扩展链接到一个多流网络框架,该框架同时计算并集成了表现、运动和姿势信息。他们介绍了一个Markov链模型,该模型先后添加了这些线索,并在动作识别和动作定位方面都取得了可喜的成果。PoTion[25]是P-CNN的后续工作,但引入了更强大的特征表示,来编码人类骨骼关键点的运动。他们首先运行了一个正式的人体姿态估计器,并在每个帧中提取了人体关节的heatmap。然后,他们通过在时间上汇总这些概率图来获得PoTion表示。PoTion是一种轻巧且性能优于以前的姿势表示[23,282]。另外,它被证明是对表现流和运动流的补充,例如。将PoTion与I3D结合使用[14]在UCF101上达到了SOTA的成绩(98.2%)。

Object information is another important cue because most human actions involve human-object interaction. Wu [232] proposed to leverage both object features and scene features to help video action recognition. The object and scene features were extracted from state-of-the-art pretrained object and scene detectors. Wang et al. [252] took a step further to make the network end-to-end trainable. They introduced a two-stream semantic region based method, by replacing a standard spatial stream with a Faster RCNN network [171], to extract semantic information about the object, person and scene.

对象信息是另一个重要线索,因为大多数人类行为都涉及人与对象的交互。Wu[232]提出要同时利用对象特征和场景特征来帮助视频动作识别。对象和场景特征是从具有SOTA成绩的的预训练对象和场景检测器中提取的。Wang等人[252]进一步采取了措施,使网络可以端到端的训练。他们引入了一种基于双流语义区域的方法,通过用Faster RCNN[171]替换标准空间流,以提取有关对象、人物和场景的语义信息。

Audio signals usually come with video, and are complementary to the visual information. Wu et al. [233] introduced a multi-stream framework that integrates spatial, short-term motion, long-term temporal and audio in videos to digest complementary clues. Recently, Xiao et al. [237] introduced AudioSlowFast following [45], by adding another audio pathway to model vision and sound in an unified representation.

声音信号通常与视频一起提供,并且是视觉信息的补充。Wu等人[233]引入了一个多流框架,该框架将视频中的空间、短期运动、长期时间和音频集成在一起,以消化互补的线索。最近,Xiao等人[237]在[45]之后引入了AudioSlowFast,它添加了另一条音频路径来以统一的表示方式对视觉和声音进行建模。

In RGB-D video action recognition field, using depth information is standard practice [59]. However, for visionbased video action recognition (e.g., only given monocular videos), we do not have access to ground truth depth information as in the RGB-D domain. An early attempt Depth2Action [280] uses off-the-shelf depth estimators to extract depth information from videos and use it for action recognition.

在RGB-D视频动作识别领域,使用深度信息是标准做法[59]。但是,对于基于视觉的视频动作识别(例如,仅给定的单眼视频),我们无法像RGB-D域那样访问深度信息。较早的尝试Depth2Action [280]使用现成的深度估计器从视频中提取深度信息,并将其用于动作识别。

Essentially, multi-stream networks is a way of multimodality learning, using different cues as input signals to help video action recognition. We will discuss more on multi-modality learning in section 5.12.

本质上,多流网络是一种多模式学习的方法,它使用不同的线索作为输入信号来帮助视频动作识别。 我们将在5.12节中讨论有关多模式学习的更多信息。

The rise of 3D CNNs(3D卷积神经网络的崛起)

Pre-computing optical flow is computationally intensive and storage demanding, which is not friendly for large-scale training or real-time deployment. A conceptually easy way to understand a video is as a 3D tensor with two spatial and one time dimension. Hence, this leads to the usage of 3D CNNs as a processing unit to model the temporal information in a video.

预计算光流的计算量大且存储要求高,这对于大规模训练或实时部署而言并不友好。从概念上讲,了解视频的一种简便方法是将其作为具有两个空间和一个时间维度的3D张量。因此,可以使用3D CNN作为一个处理单元,对视频的时序信息进行建模。

The seminal work for using 3D CNNs for action recognition is [91]. While inspiring, the network was not deep enough to show its potential. Tran et al. [202] extended [91] to a deeper 3D network, termed C3D. C3D follows the modular design of [188], which could be thought of as a 3D version of VGG16 network. Its performance on standard benchmarks is not satisfactory, but shows strong generalization capability and can be used as a generic feature extractor for various video tasks [250].

使用3D CNN进行动作识别的开创之作是[91]。在鼓舞人心的同时,该网络的深度还不足以显示其潜力。Tran等人[202]将[91]扩展到了更深的3D网络,称为C3D。C3D遵循[188]的模块化设计,可以将其视为VGG16网络的3D版本。它在基准性能测试上的性能并不令人满意,但是显示出强大的泛化能力,可以用作各种视频任务的通用特征提取器[250]。

However, 3D networks are hard to optimize. In order to train a 3D convolutional filter well, people need a largescale dataset with diverse video content and action categories. Fortunately, there exists a dataset, Sports1M [99] which is large enough to support the training of a deep 3D network. However, the training of C3D takes weeks to converge. Despite the popularity of C3D, most users just adopt it as a feature extractor for different use cases instead of modifying/fine-tuning the network. This is partially the reason why two-stream networks based on 2D CNNs dominated the video action recognition domain from year 2014 to 2017.

但是,3D网络很难优化。为了很好地训练3D卷积卷积核,人们需要具有不同视频内容和动作类别的大规模数据集。幸运的是,存在一个数据集Sports1M[99],该数据集足够大以支持深度3D网络的训练。但是,对C3D的训练需要花费数周的时间才能收敛。尽管C3D流行,但大多数用户只是将其用作针对不同用例的特征提取器,而不是修改/微调网络。这是2014年至2017年间,基于2D CNN的双流网络在视频动作识别领域中仍占据主导地位的原因。

The situation changed when Carreira et al. [14] proposed I3D in year 2017. As shown in Figure 6, I3D takes a video clip as input, and forwards it through stacked 3D convolutional layers. A video clip is a sequence of video frames, usually 16 or 32 frames are used. The major contributions of I3D are: 1) it adapts mature image classification architectures to use for 3D CNN; 2) For model weights, it adopts a method developed for initializing optical flow networks in [217] to inflate the ImageNet pre-trained 2D model weights to their counterparts in the 3D model. Hence, I3D bypasses the dilemma that 3D CNNs have to be trained from scratch. With pre-training on a new large-scale dataset Kinetics400 [100], I3D achieved a 95.6% on UCF101 and 74.8% on HMDB51. I3D ended the era where different methods reported numbers on small-sized datasets such as UCF101 and HMDB51. Publications following I3D needed to report their performance on Kinetics400, or other large-scale benchmark datasets, which pushed video action recognition to the next level. In the next few years, 3D CNNs advanced quickly and became top performers on almost every benchmark dataset. We will review the 3D CNNs based literature in several categories below.

这个状况在Carreira等人[14]于2017年提出I3D后发生了改变。如图6所示,I3D将视频剪辑作为输入,并通过堆叠的3D卷积层转发传播。视频剪辑是一系列视频帧,通常采用16或32帧。I3D的主要贡献是:1)它将成熟的用于图像分类的网络体系结构用于3D CNN;2)对于模型权重,它采用[217]中开发的用于初始化光流网络的方法,将ImageNet预训练的2D模型权重扩张为3D模型中的对应权重。因此,I3D绕开了必须从零开始训练3D CNN的难题。通过在新的大规模数据集Kinetics400[100]上进行预训练,I3D在UCF101上达到了95.6%的准确度,在HMDB51上达到了74.8%的准确度。I3D结束了以不同的方法在小型数据集(例如UCF101和HMDB512)上报告性能的时代。I3D之后的论文需要在Kinetics400或其他大规模基准数据集上报告其性能,这将视频动作识别领域推向了新的高度。在接下来的几年中,3D CNN迅速发展,并成为几乎所有基准数据集上的佼佼者。我们将在以下几个类别中回顾基于3D CNN的文献。

We want to point out that 3D CNNs are not replacing two-stream networks, and they are not mutually exclusive. They just use different ways to model the temporal relationship in a video. Furthermore, the two-stream approach is a generic framework for video understanding, instead of a specific method. As long as there are two networks, one for spatial appearance modeling using RGB frames, the other for temporal motion modeling using optical flow, the method may be categorized into the family of two-stream networks. In [14], they also build a temporal stream with I3D architecture and achieved even higher performance, 98.0% on UCF101 and 80.9% on HMDB51. Hence, the final I3D model is a combination of 3D CNNs and twostream networks. However, the contribution of I3D does not lie in the usage of optical flow.

我们要指出的是3D CNN并不能替代双流网络,而且它们也不是互斥的。他们只是使用不同的方式来建模视频中的时序信息。此外,两流方法是用于视频理解的通用框架,而不是特定的方法。只要有两个网络,一个用于使用RGB帧的空间外观信息建模,另一个用于使用光流的时间运动信息建模,则该方法就可以归类为双流网络。在[14]中,他们还使用I3D架构构建了一个时间流,并获得了更高的性能,UCF101上准确度为98.0%,HMDB51上准确度为80.9%。因此,最终的I3D模型是3D CNN和双流网络的结合。但是,I3D的贡献并不在于光流的使用。

Mapping from 2D to 3D CNNs(映射2D CNN至3D CNN)

2D CNNs enjoy the benefit of pre-training brought by the large-scale of image datasets such as ImageNet [30] and Places205 [270], which cannot be matched even with the largest video datasets available today. On these datasets numerous efforts have been devoted to the search for 2D CNN architectures that are more accurate and generalize better. Below we describe the efforts to capitalize on these advances for 3D CNNs.

2D CNN可以享受大规模图像数据集(如ImageNet[30]和Places205[270])带来的预训练优势,即使当今最大的视频数据集也无法与之匹敌。在这些数据集上,人们进行了许多努力来寻找更准确、通用性更好的2D CNN架构。下面,我们描述如何为了3D CNN利用这些数据集而所做的努力。

ResNet3D [74] directly took 2D ResNet [76] and replaced all the 2D convolutional filters with 3D kernels. They believed that by using deep 3D CNNs together with large-scale datasets one can exploit the success of 2D CNNs on ImageNet. Motivated by ResNeXt [238], Chen et al. [20] presented a multi-fiber architecture that slices a complex neural network into an ensemble of lightweight networks (fibers) that facilitate information flow between fibers, reduces the computational cost at the same time. Inspired by SENet [81], STCNet [33] propose to integrate channel-wise information inside a 3D block to capture both spatial-channels and temporal-channels correlation information throughout the network.

ResNet3D[74]直接采用2D ResNet[76],并用3D卷积核替换了所有2D卷积核。他们认为,通过结合更深的3D CNN和大规模数据集,人们可以复现2D CNN在ImageNet上取得的成功。受ResNeXt [238]的启发,Chen等人[20]提出了一种多纤维(multi-fiber)架构,该架构将复杂的神经网络切成数个轻量级网络(纤维)的集合,从而促进了纤维之间的信息流,同时降低了计算成本。受SENet[81]的启发,STCNet[33]提出在3D块整合通道方式(channel-wise)信息,以捕获整个网络中的空间通道和时间通道相关信息。

Unifying 2D and 3D CNNs(统一2D和3D CNN)

To reduce the complexity of 3D network training, P3D [169] and R2+1D [204] explore the idea of 3D factorization. To be specific, a 3D kernel (e.g., 3 x 3 x 3) can be factorized to two separate operations, a 2D spatial convolution (e.g., 1 x 3 x 3) and a 1D temporal convolution (e.g., 3 x 1 x 1). The differences between P3D and R2+1D are how they arrange the two factorized operations and how they formulate each residual block. Trajectory convolution [268] follows this idea but uses deformable convolution for the temporal component to better cope with motion.

为了降低3D网络训练的复杂性,P3D[169]和R2+1D[204]探索了3D分解的思想。具体而言,可以将3D卷积核(例如3 x 3 x 3)分解为两个单独的操作,即2D空间卷积(例如1 x 3 x 3)和1D时间卷积(例如3 x 1 x 1)。P3D和R2 + 1D之间的区别在于它们如何安排两个分解运算以及它们如何公式化每个残差块。轨迹卷积(Trajectory convolution)[268]遵循了这一思想,但对时间分量使用了可变形卷积以更好地应对运动信息。

Another way of simplifying 3D CNNs is to mix 2D and 3D convolutions in a single network. MiCTNet [271] integrates 2D and 3D CNNs to generate deeper and more informative feature maps, while reducing training complexity in each round of spatio-temporal fusion. ARTNet [213] introduces an appearance-and-relation network by using a new building block. The building block consists of a spatial branch using 2D CNNs and a relation branch using 3D CNNs. S3D [239] combines the merits from approaches mentioned above. It first replaces the 3D convolutions at the bottom of the network with 2D kernels, and find that this kind of top-heavy network has higher recognition accuracy. Then S3D factorizes the remaining 3D kernels as P3D and R2+1D do, to further reduce the model size and training complexity. A concurrent work named ECO [283] also adopts such a top-heavy network to achieve online video understanding.

简化3D CNN的另一种方法是在单个网络中混合2D和3D卷积。MiCTNet[271]集成了2D和3D CNN以生成更深入、更有用的特征图,同时降低了每一轮时空融合的训练复杂性。ARTNet[213]通过使用新的结构块引入了外观和关系网络(appearance-and-relation network)。结构块由使用2D CNN的空间分块和使用3D CNN的关系分块组成。S3D[239]结合了上述方法的优点。它首先用2D卷积核替换了网络底部的3D卷积,并发现这种头重脚轻的网络具有更高的识别准确度。然后,S3D像P3D和R2+1D一样分解其余3D内核,以进一步减少模型的大小和训练复杂性。一个同时进行的名叫ECO[283]的工作也采用了这样一个头重脚轻的网络来实现在线视频理解。

Long-range temporal modeling(长时序信息建模)

In 3D CNNs, long-range temporal connection may be achieved by stacking multiple short temporal convolutions, e.g., 3 x 3 x 3 filters. However, useful temporal information may be lost in the later stages of a deep network, especially for frames far apart.

在3D CNN中,可以通过堆叠多个短时序卷积(例如 3 x 3 x 3卷积核)来实现长时序连接。但是,有用的时间信息可能会在深度网络的后期阶段丢失,尤其是对于相距较远的帧而言。

In order to perform long-range temporal modeling, LTC [206] introduces and evaluates long-term temporal convolutions over a large number of video frames. However, limited by GPU memory, they have to sacrifice input resolution to use more frames. After that, T3D [32] adopted a densely connected structure [83] to keep the original temporal information as complete as possible to make the final prediction. Later,Wang et al. [219] introduced a new building block, termed non-local. Non-local is a generic operation similar to self-attention [207], which can be used for many computer vision tasks in a plug-and-play manner. As shown in Figure 6, they used a spacetime non-local module after later residual blocks to capture the long-range dependence in both space and temporal domain, and achieved improved performance over baselines without bells and whistles. Wu et al. [229] proposed a feature bank representation, which embeds information of the entire video into a memory cell, to make context-aware prediction. Recently, V4D [264] proposed video-level 4D CNNs, to model the evolution of long-range spatio-temporal representation with 4D convolutions.

为了执行长时序信息建模,LTC[206]引入并评估了大量视频帧上的长时序卷积。但是,受显存的限制,它们必须牺牲输入分辨率才能使用更多帧。此后,T3D[32]采用密集连接的结构[83]来使原始时序信息尽可能完整,以便进行最终预测。后来,王等人[219]引入了一个新的结构块,称为non-local。non-local是一种类似于自注意力机制[207]的通用操作,可以即插即用的方式用于许多计算机视觉任务。如图6所示,他们在残差块的后面使用一个时空non-local模块来同时捕获时空范围内的长期依赖关系,并且在没有花里胡哨的情况下实现了优于基线的性能。Wu等人[229]提出了一个特征库表示,它将整个视频的信息嵌入到一个存储单元中,以进行上下文感知的预测。最近,V4D[264]提出了视频级4D CNN,以对具有4D卷积的长序时空表示的演化信息进行建模。

Enhancing 3D efficiency(提高3D卷积的效率)

In order to further improve the efficiency of 3D CNNs (i.e., in terms of GFLOPs, model parameters and latency), many variants of 3D CNNs begin to emerge.

为了进一步提高3D CNN的效率(就GFLOP、模型参数和延迟而言),出现了许多3D CNN的变体。

Motivated by the development in efficient 2D networks, researchers started to adopt channel-wise separable convolution and extend it for video classification [111, 203]. CSN [203] reveals that it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions, and is able to obtain state-of-the-art performance while being 2 to 3 times faster than the previous best approaches. These methods are also related to multi-fiber networks [20] as they are all inspired by group convolution.

受2D网络高效发展的推动,研究人员开始将可分离通道方式卷积扩展到视频分类[111,203]上。CSN[203]揭示了通过分离通道交互和时空交互是一种有效的方法,并且能够达到SOTA的成绩,同时比以前的最佳方法还要快2至3倍。这些方法也都与多纤维网络有关[20],因为它们都受到组卷积(group convolution)方法的启发。

Recently, Feichtenhofer et al. [45] proposed SlowFast, an efficient network with a slow pathway and a fast pathway. The network design is partially inspired by the biological Parvo- and Magnocellular cells in the primate visual systems. As shown in Figure 6, the slow pathway operates at low frame rates to capture detailed semantic information, while the fast pathway operates at high temporal resolution to capture rapidly changing motion. In order to incorporate motion information such as in two-stream networks, SlowFast adopts a lateral connection to fuse the representation learned by each pathway. Since the fast pathway can be made very lightweight by reducing its channel capacity, the overall efficiency of SlowFast is largely improved. Although SlowFast has two pathways, it is different from the two-stream networks [187], because the two pathways are designed to model different temporal speeds, not spatial and temporal modeling. There are several concurrent papers using multiple pathways to balance the accuracy and efficiency [43].

最近,Feichtenhofer等人[45]提出了SlowFast,一种具有慢速路径和快速路径的有效网络。网络设计部分受到灵长类动物视觉系统中细小细胞和巨细胞的启发。如图6所示,慢速路径以低帧速率运行以捕获详细的语义信息,而快速路径以高时间分辨率运行以捕获快速变化的运动。为了合并运动信息(例如在双流网络中),SlowFast采用横向连接来融合每个路径学习到的特征表示。由于可以通过减小通道容量来使快速路径变得非常轻巧,因此SlowFast的整体效率得到了极大的提高。尽管SlowFast有两个路径,但它不同于双流网络[187],因为这两个路径被设计为模拟不同的时间速度,而不是空间和时间模型。有几篇同时发表的论文使用多种路径来平衡准确率和效率[43]。

Following this line, Feichtenhofer [44] introduced X3D that progressively expand a 2D image classification architecture along multiple network axes, such as temporal duration, frame rate, spatial resolution, width, bottleneck width, and depth. X3D pushes the 3D model modification/factorization to an extreme, and is a family of efficient video networks to meet different requirements of target complexity. With similar spirit, A3D [276] also leverages multiple network configurations. However, A3D trains these configurations jointly and during inference deploys only one model. This makes the model at the end more efficient. In the next section, we will continue to talk about efficient video modeling, but not based on 3D convolutions.

遵循这一思路,Feichtenhofer[44]引入了X3D,X3D沿多个网络轴逐步扩展了2D图像分类体系结构,例如时序持续时间,帧速率,空间分辨率,宽度,瓶颈宽度(bottleneck width)和深度。X3D将3D模型的修改/分解推到了极致,是一种高效的视频网络,并且可以满足不同目标的不同负载型要求。 一个类似的启发,A3D[276]还利用了多种网络配置。但是,A3D同时训练这些网络配置,并且在推理期间仅部署一个网络。这样可以使模型的最终效率更高。在下一节中,我们将继续讨论有效的视频建模,但不基于3D卷积。

Efficient Video Modeling(提高视频建模效率)

With the increase of dataset size and the need for deployment, efficiency becomes an important concern.

随着数据集大小的增加和部署的需求,效率成为一个重要的问题。

If we use methods based on two-stream networks, we need to pre-compute optical flow and store them on local disk. Taking Kinetics400 dataset as an illustrative example, storing all the optical flow images requires 4.5TB disk space. Such a huge amount of data would make I/O become the tightest bottleneck during training, leading to a waste of GPU resources and longer experiment cycle. In addition, pre-computing optical flow is not cheap, which means all the two-stream networks methods are not real-time.

如果使用基于双流网络的方法,则需要预先计算光流并将其存储在本地磁盘上。以Kinetics400数据集为例,存储所有光流图像需要4.5TB磁盘空间。如此大量的数据将使I/O成为训练期间最严重的瓶颈,从而导致GPU资源的浪费和更长的训练周期。另外,预先计算光流的花费并不低,这意味着所有的双流网络方法都不是实时的。

If we use methods based on 3D CNNs, people still find that 3D CNNs are hard to train and challenging to deploy. In terms of training, a standard SlowFast network trained on Kinetics400 dataset using a high-end 8-GPU machine takes 10 days to complete. Such a long experimental cycle and huge computing cost makes video understanding research only accessible to big companies/labs with abundant computing resources. There are several recent attempts to speed up the training of deep video models [230], but these are still expensive compared to most image-based computer vision tasks. In terms of deployment, 3D convolution is not as well supported as 2D convolution for different platforms. Furthermore, 3D CNNs require more video frames as input which adds additional IO cost.

如果我们使用基于3D CNN的方法,人们仍然会发现3D CNN很难训练并且难以部署。在训练方面,使用高端的拥有8个GPU的机器在Kinetics400数据集上训练标准SlowFast的网络需要10天才能完成。如此长的训练周期和巨大的计算成本,使得视频理解研究只有拥有大量计算资源的大公司/实验室才能进行。最近有几种尝试来加快深度视频模型的训练速度[230],但是与大多数基于图像的计算机视觉任务相比,这些方法仍然昂贵。在部署方面,在不同平台下对3D卷积的支持程度仍不如2D卷积。此外,3D CNN需要更多的视频帧作为输入,这增加了额外的IO成本。

Hence, starting from year 2018, researchers started to investigate other alternatives to see how they could improve accuracy and efficiency at the same time for video action recognition. We will review recent efficient video modeling methods in several categories below.

因此,从2018年开始,研究人员开始研究其他替代方案,以了解它们如何同时提高视频动作识别的准确性和效率。我们将从以下几个方面分别回顾最近有效的视频建模方法。

Flow-mimic approaches(流模拟方法)

One of the major drawback of two-stream networks is its need for optical flow. Pre-computing optical flow is computationally expensive, storage demanding, and not end-toend trainable for video action recognition. It is appealing if we can find a way to encode motion information without using optical flow, at least during inference time.

双流网络的主要缺点之一是对光流的需求。预计算光流的计算成本是昂贵的,且需要存储空间,并且不能端到端地用于视频动作识别训练。有吸引力的是,至少在预测推理期间,我们可以找到一种无需使用光流即可对运动信息进行编码的方法。

[146] and [35] are early attempts for learning to estimate optical flow inside a network for video action recognition. Although these two approaches do not need optical flow during inference, they require optical flow during training in order to train the flow estimation network. Hidden two-stream networks [278] proposed MotionNet to replace the traditional optical flow computation. MotionNet is a lightweight network to learn motion information in an unsupervised manner, and when concatenated with the temporal stream, is end-to-end trainable. Thus, hidden twostream CNNs [278] only take raw video frames as input and directly predict action classes without explicitly computing optical flow, regardless of whether its the training or inference stage. PAN [257] mimics the optical flow features by computing the difference between consecutive feature maps. Following this direction, [197, 42, 116, 164] continue to investigate end-to-end trainable CNNs to learn opticalflow-like features from data. They derive such features directly from the definition of optical flow [255]. MARS [26] and D3D [191] used knowledge distillation to combine twostream networks into a single stream, e.g., by tuning the spatial stream to predict the outputs of the temporal stream. Recently, Kwon et al. [110] introduced MotionSqueeze module to estimate the motion features. The proposed module is end-to-end trainable and can be plugged into any network, similar to [278].

[146]和[35]是用于学习估算网络内部的光流以进行视频动作识别的早期尝试。尽管这两种方法在推理过程中不需要光流,但是它们在训练过程中仍需要光流以训练光流估计网络。Hidden two-stream networks[278]提出了MotionNet来代替传统的光流计算。MotionNet是一种轻量级的网络,用于以无监督的方式学习运动信息,并且与时间流连接时,是端到端可训练的。因此,Hidden two-stream networks[278]仅采用原始视频帧作为输入即可直接预测动作类别,而无需显式计算光流,即不论是训练阶段还是预测推理阶段。PAN[257]通过计算连续特征图之间的差异来模拟光流特征。遵循这个方向,[197、42、116、164]继续研究端到端的可训练CNN,以从数据中学习类似光流的特征。他们直接从光流的定义中推导了这些特征[255]。MARS[26]和D3D[191]使用知识蒸馏(knowledge distillation)法将双流网络合并为单个流,例如,通过调整空间流以预测时间流的输出。最近,Kwon等人[110]引入了MotionSqueeze模块来估计运动特征。该模块是端到端可训练的,并且可以插入到任何网络中去,类似于[278]。

Temporal modeling without 3D convolution(不使用3D卷积建立时序模型)

A simple and natural choice to model temporal relationship between frames is using 3D convolution. However, there are many alternatives to achieve this goal. Here, we will review some recent work that performs temporal modeling without 3D convolution.

针对帧之间的时序关系,使用3D卷积是一种简单且自然的建模方法。但是,有许多替代方法可以实现此目标。在这里,我们将回顾一些在不进行3D卷积的情况下执行时间建模的最新工作。

Lin et al. [128] introduce a new method, termed temporal shift module (TSM). TSM extends the shift operation [228] to video understanding. It shifts part of the channels along the temporal dimension, thus facilitating information exchanged among neighboring frames. In order to keep spatial feature learning capacity, they put temporal shift module inside the residual branch in a residual block. Thus all the information in the original activation is still accessible after temporal shift through identity mapping. The biggest advantage of TSM is that it can be inserted into a 2D CNN to achieve temporal modeling at zero computation and zero parameters. Similar to TSM, TIN [182] introduces a temporal interlacing module to model the temporal convolution.

Lin等人[128]引入了一种新的方法,称为temporal shift module(TSM)。TSM将移位操作[228]扩展到视频理解领域。它沿着时间维度移动部分通道,从而促进相邻帧之间交换的信息。为了保持空间特征的学习能力,他们将时间偏移模块放在残差块的残差分支内。因此,在进行时间偏移之后恒等映射后,仍可以访问原始激活中的所有信息。TSM的最大优点是可以将其插入2D CNN中,以零计算和零参数的方式实现时间建模。类似于TSM,TIN[182]引入了一个时间交织模块来对时间卷积进行建模。

There are several recent 2D CNNs approaches using attention to perform long-term temporal modeling [92, 122, 132, 133]. STM [92] proposes a channel-wise spatiotemporal module to present the spatio-temporal features and a channel-wise motion module to efficiently encode motion features. TEA [122] is similar to STM, but inspired by SENet [81], TEA uses motion features to recalibrate the spatio-temporal features to enhance the motion pattern. Specifically, TEA has two components: motion excitation and multiple temporal aggregation, while the first one handles short-range motion modeling and the second one efficiently enlarge the temporal receptive field for long-range temporal modeling. They are complementary and both light-weight, thus TEA is able to achieve competitive results with previous best approaches while keeping FLOPs as low as many 2D CNNs. Recently, TEINet [132] also adopts attention to enhance temporal modeling. Note that, the above attention-based methods are different from non-local [219], because they use channel attention while nonlocal uses spatial attention.

最近有几种使用注意力进行长序时间建模的2D CNN方法[92,122,132,133]。STM[92]提出了用一个时空通道模块来表示时空特征,以及一个运动通道模块来有效地编码运动特征。TEA[122]与STM类似,但受SENet[81]的启发,TEA使用运动特征重新校准时空特征以增强运动模式。具体来说,TEA具有两个部分:运动激励和多时间聚集,第一个部分处理短距离运动信息,第二个部分有效地扩大时间空间,用于长序时间建模。它们是互补的,而且是轻量级的,因此TEA能够使用以前的最佳方法获得竞争性结果的同时,将FLOP尽可能的保持与2D CNN相似的水平。最近,TEINet[132]也开始关注增强时间建模。请注意,上述基于注意力的方法与non-local方法[219]有所不同,因为它们使用通道注意力,而non-local方法则使用空间注意力。

Miscellaneous(杂项)

In this section, we are going to show several other directions that are popular for video action recognition in the last decade.

在本节中,我们将展示在过去十年中流行于视频动作识别的其他几个方向。

Trajectory-based methods(基于轨迹的方法)

While CNN-based approaches have demonstrated their superiority and gradually replaced the traditional hand-crafted methods, the traditional local feature pipeline still has its merits which should not be ignored, such as the usage of trajectory.

尽管基于CNN的方法已经证明了其优越性,并逐步取代了传统的手工提取特征方法,但传统的局部特征途径仍然具有其不可忽视的优点,例如使用基于轨迹的方法。

Inspired by the good performance of trajectory-based methods [210], Wang et al. [214] proposed to conduct trajectory-constrained pooling to aggregate deep convolutional features into effective descriptors, which they term as TDD. Here, a trajectory is defined as a path tracking down pixels in the temporal dimension. This new video representation shares the merits of both hand-crafted features and deep-learned features, and became one of the top performers on both UCF101 and HMDB51 datasets in the year 2015. Concurrently, Lan et al. [113] incorporated both Independet Subspace Analysis (ISA) and dense trajectories into the standard two-stream networks, and show the complementarity between data-independent and data-driven approaches. Instead of treating CNNs as a fixed feature extractor, Zhao et al. [268] proposed trajectory convolution to learn features along the temporal dimension with the help of trajectories.

受到基于轨迹法[210]的良好性能的启发,Wang等人[214]提出进行轨迹约束合并,以将深度卷积特征聚合为有效的描述信息,它们被称之为TDD。在此,轨迹被定义为在时间维度上追踪像素的路径。这种新的视频表示方法,同时具有手工提取特征方法和深度学习方法的优点,并在2015年成为UCF101和HMDB51数据集上表现最好的方法之一。同时,Lan等人[113]将Independet Subspace Analysis(ISA)和密集轨迹追踪法合并到标准的双流网络中,并显示出了数据独立方法和数据驱动方法之间的互补性。Zhao等人[268]提出了轨迹卷积,以借助轨迹学习时间维度的特征。

Rank pooling(等级池化)

There is another way to model temporal information inside a video, termed rank pooling (a.k.a learning-to-rank). The seminal work in this line starts from VideoDarwin [53], that uses a ranking machine to learn the evolution of the appearance over time and returns a ranking function. The ranking function should be able to order the frames of a video temporally, thus they use the parameters of this ranking function as a new video representation. VideoDarwin [53] is not a deep learning based method, but achieves comparable performance and efficiency.

还有另一种在视频中建立时序信息模型的方法,称为rank pooling(也称为learning-to-rank)。该系列中的开创性工作始于VideoDarwin[53],它使用排名机器来学习随时间变化的表现并返回排名函数。排名函数应该能够在时间上对视频帧进行排序,因此它们使用此排序功能的参数作为新的视频表现形式。VideoDarwin[53]并不是基于深度学习的方法,但是可以实现可比的性能和效率。

To adapt rank pooling to deep learning, Fernando [54] introduces a differentiable rank pooling layer to achieve end-to-end feature learning. Following this direction, Bilen et al. [9] apply rank pooling on the raw image pixels of a video producing a single RGB image per video, termed dynamic images. Another concurrent work by Fernando [51] extends rank pooling to hierarchical rank pooling by stacking multiple levels of temporal encoding. Finally, [22] propose a generalization of the original ranking formulation [53] using subspace representations and show that it leads to significantly better representation of the dynamic evolution of actions, while being computationally cheap.

为了使rank pooling适应深度学习,Fernando[54]引入了可区分的rank pooling层来实现端到端的特征学习。按照这个方向,Bilen等人[9]在视频的原始图像像素上应用rank pooling,每个视频产生单个RGB图像,称为动态图像。Fernando[51]的另一项并发工作是通过堆叠多个级别的时间编码将rank pooling扩展到分层rank pooling。最后,[22]提出了使用子空间表示法对原始排名公式[53]的概括,并表明它可以显着的更好的表示动作的动态演变,且计算量小。

Compressed video action recognition(压缩视频动作识别)

Most video action recognition approaches use raw videos (or decoded video frames) as input. However, there are several drawbacks of using raw videos, such as the huge amount of data and high temporal redundancy. Video compression methods usually store one frame by reusing contents from another frame (i.e., I-frame) and only store the difference (i.e., P-frames and B-frames) due to the fact that adjacent frames are similar. Here, the I-frame is the original RGB video frame, and P-frames and B-frames include the motion vector and residual, which are used to store the difference. Motivated by the developments in the video compression domain, researchers started to adopt compressed video representations as input to train effective video models.

大多数视频动作识别方法使用原始视频(或解码的视频帧)作为输入。但是,使用原始视频存在一些缺点,例如,大量的数据和较高的时间冗余。视频压缩方法通常通过重用来自另一帧(即,I帧)的内容来存储一帧,并且由于相邻帧是相似的事实而仅存储差异(即,P帧和B帧)。在此,I帧是原始的RGB视频帧,P帧和B帧包含运动矢量和残差,用于存储差异。受视频压缩领域发展的推动,研究人员开始采用压缩视频作为输入,以训练有效的视频模型。

Since the motion vector has coarse structure and may contain inaccurate movements, Zhang et al. [256] adopted knowledge distillation to help the motion-vector-based temporal stream mimic the optical-flow-based temporal stream. However, their approach required extracting and processing each frame. They obtained comparable recognition accuracy with standard two-stream networks, but were 27 times faster. Wu et al. [231] used a heavyweight CNN for the I frame and lightweight CNN’s for the P frames. This required that the motion vectors and residuals for each P frame be referred back to the I frame by accumulation. DMC-Net [185] is a follow-up work to [231] using adversarial loss. It adopts a lightweight generator network to help the motion vector capturing fine motion details, instead of knowledge distillation as in [256]. A recent paper SCSampler [106], also adopts compressed video representation for sampling salient clips and we will discuss it in the next section 3.5.4. As yet none of the compressed approaches can deal with B-frames due to the added complexity.

由于运动矢量具有粗糙的结构并且可能包含不准确的运动,因此Zhang等人[256]采用知识蒸馏来帮助基于运动矢量的时态流模仿基于光流的时态流。但是,他们的方法需要提取和处理每个帧。他们获得了与标准双流网络相当的识别精度,但速度提高了27倍。Wu等人[231]在I帧中使用了重量级的CNN,在P帧中使用了轻量级的CNN。这就要求每个P帧的运动矢量和残差通过累加返回I帧。DMC-Net[185]是使用对抗损失(adversarial loss)的[231]的后续工作。它采用轻量级的生成网络来帮助运动矢量捕获精细的运动细节,而不是像[256]中那样进行知识提取。最近的论文SCSampler[106]也采用压缩视频表示来采样显着片段,我们将在下一个3.5.4节中讨论它。迄今为止,由于复杂性的增加,没有一种压缩方法可以处理B帧。

Frame/Clip sampling(帧/片采样)

Most of the aforementioned deep learning methods treat every video frame/clip equally for the final prediction. However, discriminative actions only happen in a few moments, and most of the other video content is irrelevant or weakly related to the labeled action category. There are several drawbacks of this paradigm. First, training with a large proportion of irrelevant video frames may hurt performance. Second, such uniform sampling is not efficient during inference.

大多数上述深度学习方法将每个视频帧/片段均视为最终预测。但是,歧视性动作只会在短时间内发生,并且其他大多数视频内容与标记的动作类别无关或微弱相关。这种范例有几个缺点。首先,使用大量不相关的视频帧进行训练可能会损害性能。其次,这种统一采样在推理过程中效率不高。

Partially inspired by how human understand a video using just a few glimpses over the entire video [251], many methods were proposed to sample the most informative video frames/clips for both improving the performance and making the model more efficient during inference.

受到人类可以仅看一样就可以了解整个视频[251]的部分启发,许多方法被提出了对信息量大的视频帧/片段进行采样,以提高性能并在推理过程中提高模型的效率。

he first attempts to propose an end-to-end framework to simultaneously identify key volumes and do action classification. Later, [98] introduce AdaScan that predicts the importance score of each video frame in an online fashion, which they term as adaptive temporal pooling. Both of these methods achieve improved performance, but they still adopt the standard evaluation scheme which does not show efficiency during inference. Recent approaches focus more on the efficiency [41, 234, 8, 106]. AdaFrame [234] follows [251, 98] but uses a reinforcement learning based approach to search more informative video clips. Concurrently, [8] uses a teacher-student framework, i.e., a see-it-all teacher can be used to train a compute efficient see-very-little student. They demonstrate that the efficient student network can reduce the inference time by 30% and the number of FLOPs by approximately 90% with negligible performance drop. Recently, SCSampler [106] trains a lightweight network to sample the most salient video clips based on compressed video representations, and achieve state-of-the-art performance on both Kinetics400 and Sports1M dataset. They also empirically show that such saliency-based sampling is not only efficient, but also enjoys higher accuracy than using all the video frames.

KVM[277]是一个端到端框架,且同时识别关键卷和进行操作分类的首批尝试之一。后来,[98]介绍了AdaScan,它以在线方式预测每个视频帧的重要性得分,他们将其称为自适应时间池。这两种方法均实现了改进的性能,但是它们仍然采用标准的评估方案,该方案在推理过程中没有显示出效率。最近的方法更多地关注效率[41、234、8、106]。AdaFrame[234]在[251,98]之后,但使用基于强化学习的方法来搜索更多内容丰富的视频剪辑。同时,[8]使用teacher-student框架,即可以使用“见识通识”的老师来训练计算效率高的“很少见”的学生。他们证明,高效的学生网络可以将推理时间减少30%,将FLOP的消耗减少约90%,而性能下降却可以忽略不计。最近,SCSampler[106]训练了一个轻量级的网络,以基于压缩视频表示来对最显着的视频剪辑进行采样,并在Kinetics400和Sports1M数据集上均实现了最先进的性能。他们还凭经验表明,这种基于显着性的采样不仅效率高,而且比使用所有视频帧具有更高的准确性。

Visual tempo(视觉速度)

Visual tempo is a concept to describe how fast an action goes. Many action classes have different visual tempos. In most cases, the key to distinguish them is their visual tempos, as they might share high similarities in visual appearance, such as walking, jogging and running [248]. There are several papers exploring different temporal rates (tempos) for improved temporal modeling [273, 147, 82, 281, 45, 248]. Initial attempts usually capture the video tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid [273, 147, 281]. Recently, SlowFast [45], as we discussed in section 3.3.4, utilizes the characteristics of visual tempo to design a twopathway network for better accuracy and efficiency tradeoff. CIDC [121] proposed directional temporal modeling along with a local backbone for video temporal modeling. TPN [248] extends the tempo modeling to the featurelevel and shows consistent improvement over previous approaches.

视觉速度(visual tempo)是描述动作速度的概念。许多动作类具有不同的视觉速度。在大多数情况下,区分它们的关键是视觉速度,因为它们在视觉外观上可能具有高度相似性,例如步行,慢跑和跑步[248]。有几篇论文探讨了不同的时间速率(速度)以改进时间建模[273,147,82,281,45,248]。最初的尝试通常是通过以多种速率采样原始视频并构建输入级帧金字塔[273、147、281]来捕获视频速度的。最近,如我们在3.3.4节中讨论的,SlowFast[45]利用视觉速度的特性来设计双向路径网络,以实现更好的精度和效率的权衡。CIDC[121]提出了定向时间建模以及用于视频时间建模的本地后端模型。TPN[248]将速度建模扩展到了特征级别,并显示出与以前方法相比的持续改进。

We would like to point out that visual tempo is also widely used in self-supervised video representation learning [6, 247, 16] since it can naturally provide supervision signals to train a deep network. We will discuss more details on self-supervised video representation learning in section 5.13.

我们想指出的是,视觉速度也广泛用于自我监督的视频表示学习中[6,247,16],因为它可以自然地提供监督信号来训练深度网络。我们将在5.13节中讨论有关自我监督视频表示学习的更多详细信息。

Evaluation and Benchmarking(评估和基准测试)

In this section, we compare popular approaches on benchmark datasets. To be specific, we first introduce standard evaluation schemes in section 4.1. Then we divide common benchmarks into three categories, scenefocused (UCF101, HMDB51 and Kinetics400 in section 4.2), motion-focused (Sth-Sth V1 and V2 in section 4.3) and multi-label (Charades in section 4.4). In the end, we present a fair comparison among popular methods in terms of both recognition accuracy and efficiency in section 4.5.

在本节中,我们将在一些基准数据集上比较测试一些流行方法。具体来说,我们首先在4.1节中介绍标准评估模式。然后,我们将常见基准分为三类:以场景为中心(第4.2节中的UCF101,HMDB51和Kinetics400),以运动为中心的(第4.3节中的Sth-Sth V1和V2)1和多标签(第4.4节中的Charades)。最后,我们将在第4.5节中就识别准确性和效率两方面对流行方法进行公平比较。

Evaluation scheme(评估模式)

During model training, we usually randomly pick a video frame/clip to form mini-batch samples. However, for evaluation, we need a standardized pipeline in order to perform fair comparisons.

在模型训练期间,我们通常会随机选择一个视频帧/剪辑以形成小批量样本。但是,为了进行评估,我们需要一个标准化的管道以便进行公平的比较。

For 2D CNNs, a widely adopted evaluation scheme is to evenly sample 25 frames from each video following [187, 217]. For each frame, we perform ten-crop data augmentation by cropping the 4 corners and 1 center, flipping them horizontally and averaging the prediction scores (before softmax operation) over all crops of the samples, i.e., this means we use 250 frames per video for inference.

对于2D CNN,广泛采用的评估方案是从每个视频中均匀采样25帧,于[187,217]提出。对于每一帧,我们通过裁剪4个角和1个中心,将它们水平翻转并平均所有样本的预测得分(在softmax操作之前)来执行十种数据增广,即每个视频使用250帧进行推断。

For 3D CNNs, a widely adopted evaluation scheme termed 30-view strategy is to evenly sample 10 clips from each video following [219]. For each video clip, we perform three-crop data augmentation. To be specific, we scale the shorter spatial side to 256 pixels and take three crops of 256 x 256 to cover the spatial dimensions and average the prediction scores.

对于3D CNN,一种广泛采用的评估方案称为30视图策略,从每个视频中均匀采样10个剪辑。对于每个视频剪辑,我们执行三种数据增广,于[219]提出。具体来说,我们将较短的空间边缩放到256像素,并采用三张256 x 256的片覆盖空间维度并平均预测得分。

However, the evaluation schemes are not fixed. They are evolving and adapting to new network architectures and different datasets. For example, TSM [128] only uses two clips per video for small-sized datasets [190, 109], and perform three-crop data augmentation for each clip despite its being a 2D CNN. We will mention any deviations from the standard evaluation pipeline.

但是,评估方案不是固定的。它们正在发展并适应新的网络体系结构和不同的数据集。例如,TSM[128]只使用每个视频仅有两个剪辑的小型数据集[190、109],尽管其采用2D CNN,但仍对每个剪辑执行三种数据增强。我们将标出与标准评估方法的任何差异。

In terms of evaluation metric, we report accuracy for single-label action recognition, and mAP (mean average precision) for multi-label action recognition.

在评估指标方面,我们报告了单标签动作识别的准确率,以及多标签动作识别的mAP(总平均精度)。

Scene-focused datasets(以场景为中心的数据集)

Here, we compare recent state-of-the-art approaches on scene-focused datasets: UCF101, HMDB51 and Kinetics400. The reason we call them scene-focused is because most action videos in these datasets are short, and can be recognized by static scene appearance alone as shown in Figure 4.

在这里,我们比较了针对以场景为中心的数据集:UCF101,HMDB51和Kinetics400上的SOTA研究成果。之所以将它们称为“以场景为中心”,是因为这些数据集中的大多数动作视频都很短,并且仅通过静态场景外观就可以识别出来,如图4所示。

Table 2. Results of widely adopted methods on three scene-focused datasets. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet, S: Sports1M and K: Kinetics400. NL represents non local.
表2:在三个以场景为中心的数据集上被广泛采用的方法的结果。Pre-train指对模型进行预训练的数据集。I:ImageNet,S:Sports1M,K:Kinetics400。NL代表non local。

Method Pre-train Flow Backbone Venue UCF101 HMDB51 Kinetics400
DeepVideo[99] I - AlexNet CVPR 2014 65.4 - -
Two-stream [187] I X CNN-M NeurIPS 2014 88.0 59.4 -
LRCN[37] I X CNN-M CVPR 2015 82.3 - -
TDD[214] I X CNN-M CVPR 2015 90.3 63.2 -
Fusion[50] I X VGG16 CVPR 2016 92.5 65.4 -
TSN[218] I X BN-Inception ECCV 2016 94.0 68.5 73.9
TLE[36] I X BN-Inception CVPR 2017 95.6 71.1 -
___ ___ ___ ___ ___ ___ ___ ___
C3D[202] S - VGG16-like ICCV 2015 82.3 56.8 59.5
I3D[14] I,K - BN-Inception-like CVPR 2017 95.6 74.8 71.1
P3D[169] S - ResNet50-like ICCV 2017 88.6 - 71.6
ResNet3D[74] K - ResNeXt101-like CVPR 2018 94.5 70.2 65.1
R2+1D[204] K - ResNet34-like CVPR 2018 96.8 74.5 72.0
NL I3D[219] I - ResNet101-like CVPR 2018 - - 77.7
S3D[239] I,K - BN-Inception-like ECCV 2018 96.8 75.9 74.7
SlowFast[45] - - ResNet101-NL-like ICCV 2019 - - 79.8
X3D-XXL[44] - - ResNet-like CVPR 2020 - - 80.4
TPN[248] - - ResNet101-like CVPR 2020 - - 78.9
CIDC[121] - - ResNet50-like ECCV 2020 97.9 75.2 75.5
___ ___ ___ ___ ___ ___ ___ ___
Hidden TSN[278] I - BN-Inception ACCV 2018 93.2 66.8 72.8
OFF[197] I - BN-Inception CVPR 2018 96.0 74.2 -
TSM[128] I - ResNet50 ICCV 2019 95.9 73.5 74.1
STM[92] I,K - ResNet50-like ICCV 2019 96.2 72.2 73.7
TEINet[132] I,K - ResNet50-like AAAI 2020 96.7 72.1 76.2
TEA[122] I,K - ResNet50-like CVPR 2020 96.9 73.3 76.1
MSNet[110] I,K - ResNet50-like ECCV 2020 - 77.4 76.4

Following the chronology, we first present results for early attempts of using deep learning and the two-stream networks at the top of Table 2. We make several observations. First, without motion/temporal modeling, the performance of DeepVideo [99] is inferior to all other approaches. Second, it is helpful to transfer knowledge from traditional methods (non-CNN-based) to deep learning. For example, TDD [214] uses trajectory pooling to extract motion-aware CNN features. TLE [36] embeds global feature encoding, which is an important step in traditional video action recognition pipeline, into a deep network.

按照时间顺序,我们首先在表2的顶部提供使用深度学习和双流网络的早期尝试的结果。首先,在没有运动/时间建模的情况下,DeepVideo[99]的性能不如所有其他方法。其次,将知识从传统方法(基于非CNN的方法)转移到深度学习是有帮助的。例如,TDD[214]使用轨迹池提取运动感知的CNN特征。TLE[36]将深度特征嵌入到深度网络中,这是传统视频动作识别流程中的重要一步,它嵌入了全局特征编码。

We then compare 3D CNNs based approaches in the middle of Table 2. Despite training on a large corpus of videos, C3D [202] performs inferior to concurrent work [187, 214, 217], possibly due to difficulties in optimization of 3D kernels. Motivated by this, several papers - I3D [14], P3D [169], R2+1D [204] and S3D [239] factorize 3D convolution filters to 2D spatial kernels and 1D temporal kernels to ease the training. In addition, I3D introduces an inflation strategy to avoid training from scratch by bootstrapping the 3D model weights from well-trained 2D networks. By using these techniques, they achieve comparable performance to the best two-stream network methods [36] without the need for optical flow. Furthermore, recent 3D models obtain even higher accuracy, by using more training samples [203], additional pathways [45], or architecture search [44].

然后,我们在表2的中间比较了基于3D CNN的方法。尽管对大量视频进行了训练,但C3D[202]的性能不及一些同时的工作[187、214、217],这可能是由于3D卷积核难以优化。因此,几篇论文-I3D[14],P3D[169],R2+1D[204]和S3D[239]将3D卷积核分解为2D空间核和1D时间核,以简化训练。此外,I3D引入了一种膨胀策略,通过将来自训练有素的2D网络的权重导入3D模型来避免从头开始进行训练。通过使用这些技术,它们不需要光流就可以达到与最佳双流网络方法相当的性能[36]。此外,最近的3D模型通过使用更多的训练样本[203],更多途径[45]或体系结构搜索[44]获得了更高的准确性。

Finally, we show recent efficient models in the bottom of Table 2. We can see that these methods are able to achieve higher recognition accuracy than two-stream networks (top), and comparable performance to 3D CNNs (middle). Since they are 2D CNNs and do not use optical flow, these methods are efficient in terms of both training and inference. Most of them are real-time approaches, and some can do online video action recognition [128]. We believe 2D CNN plus temporal modeling is a promising direction due to the need of efficiency. Here, temporal modeling could be attention based, flow based or 3D kernel based.

最后,我们在表2的底部显示了最近有效的模型。我们可以看到,这些方法能够实现比两流网络更高的识别精度(顶部),并且具有与3D CNN相当的性能(中间)。由于它们是2D CNN,并且不使用光流,因此这些方法在训练和推理方面都是有效的。其中大多数是实时方法,有些可以进行在线视频动作识别[128]。由于效率的需要,我们认为2D CNN和时序建模是一个有前途的方向。在这里,时序建模可以是基于注意力,基于流或基于3D卷积的。

Motion-focused datasets(以运动为中心的数据集)

In this section, we compare the recent state-of-the-art approaches on the 20BN-Something-Something (Sth-Sth) dataset. We report top1 accuracy on both V1 and V2. Sth-Sth datasets focus on humans performing basic actions with daily objects. Different from scene-focused datasets, background scene in Sth-Sth datasets contributes little to the final action class prediction. In addition, there are classes such as “Pushing something from left to right” and “Pushing something from right to left”, and which require strong motion reasoning.

在本节中,我们将在20BN-Something-Something(Sth-Sth)数据集上对最新的技术进行比较。我们同时报告了V1和V2的top1准确性。Sth-Sth数据集关注人类对日常物体执行基本动作的情况。与以场景为重点的数据集不同,Sth-Sth数据集中的背景场景对最终动作类预测的贡献很小。另外,还有诸如“从左向右推动”和“从右向左推动”之类的类,它们需要强大的运动推理能力。

Table 3. Results of widely adopted methods on Something-Something V1 and V2 datasets. We only report numbers without using optical flow. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet and K: Kinetics400. View means number of temporal clip multiples spatial crop, e.g., 30 means 10 temporal clips with 3 spatial crops each clip.
表3:在Something-Something V1和V2数据集中采用众多方法的结果。我们仅报告不使用光流的结果。Pre-train指对模型进行预训练的数据集。I:ImageNet,K:Kinetics400。View表示时间片段的数量乘以空间裁剪,例如30表示10个时间片段,每个片段具有3个空间裁剪。

Method Pre-train Backbone Frames x Views Venue V1 Top1 V2 Top1
TSN[218] I BN-Inception 8 x 1 ECCV 2016 19.7 -
I3D[14] I,K ResNet50-like 32 x 6 CVPR 2017 41.6 -
NL I3D[219] I,K ResNet50-like 32 x 6 CVPR 2018 44.4 -
NL I3D + GCN[220] I,K ResNet50-like 32 x 6 ECCV 2018 46.1 -
ECO[283] K BNIncep+ResNet18 16 x 1 ECCV 2018 41.4 -
TRN[269] I BN-Inception 8 x 1 ECCV 2018 42.0 48.8
STM[92] I ResNet50-like 8 x 30 ICCV 2019 49.2 -
STM[92] I ResNet50-like 16 x 30 ICCV 2019 50.7 -
TSM[128] K ResNet50 8 x 1 ICCV 2019 45.6 59.1
TSM[128] K ResNet50 16 x 1 ICCV 2019 47.2 63.4
bLVNet-TAM[43] I BLNet-like 8 x 2 NeurIPS 2019 46.4 59.1
bLVNet-TAM[43] I BLNet-like 16 x 2 NeurIPS 2019 48.4 61.7
TEA[122] I ResNet50-like 8 x 1 CVPR 2020 48.9 -
TEA[122] I ResNet50-like 16 x 1 CVPR 2020 51.9 -
TSM + TPN[248] K ResNet50-like 8 x 1 CVPR 2020 49.0 62.0
MSNet[110] I ResNet50-like 8 x 1 ECCV 2020 50.9 63.0
MSNet[110] I ResNet50-like 16 x 1 ECCV 2020 52.1 64.7
TIN[182] K ResNet50-like 16 x 1 AAAI 2020 47.0 60.1
TEINet[132] I ResNet50-like 8 x 1 AAAI 2020 47.4 61.3
TEINet[132] I ResNet50-like 16 x 1 AAAI 2020 49.9 62.1

By comparing the previous work in Table 3, we observe that using longer input (e.g., 16 frames) is generally better. Moreover, methods that focus on temporal modeling [128, 122, 92] work better than stacked 3D kernels [14]. For example, TSM [128], TEA [122] and MSNet [110] insert an explicit temporal reasoning module into 2D ResNet backbones and achieves state-of-the-art results. This implies that the Sth-Sth dataset needs strong temporal motion reasoning as well as spatial semantics information.

通过比较表3中的先前工作,我们观察到使用更长的输入(例如16帧)通常更好。此外,专注于时序建模的方法[128、122、92]比堆叠的3D卷积[14]可以更好地工作。例如,TSM[128],TEA[122]和MSNet[110]将显式的时序推理模块插入2D ResNet后端模型中,并获得最新的结果。这意味着Sth-Sth数据集需要强大的时间运动推理以及空间语义信息。

Multi-label datasets(多标签数据集)

In this section, we first compare the recent state-of-theart approaches on Charades dataset [186] and then we list some recent work that use assemble model or additional object information on Charades. Comparing the previous work in Table 4, we make the following observations. First, 3D models [229, 45] generally perform better than 2D models [186, 231] and 2D models with optical flow input. This indicates that the spatiotemporal reasoning is critical for long-term complex concurrent action understanding. Second, longer input helps with the recognition [229] probably because some actions require long-term feature to recognize. Third, models with strong backbones that are pre-trained on larger datasets generally have better performance [45]. This is because Charades is a medium-scaled dataset which doesn’t contain enough diversity to train a deep model.

在本节中,我们首先比较Charades数据集[186]上的SOTA成果,然后列出一些在Charades上使用assemble model或additional object information的最新工作。比较表4中的先前工作,我们得出以下观察结果。首先,三维模型[229,45]通常比二维模型[186,231],和有光流输入的2D模型有更好的表现。这表明时空推理对于长期复杂的并发动作理解至关重要。其次,较长的输入有助于识别[229],这可能是因为某些动作需要长期特征来识别。第三,有较强的经大规模数据集预训练的后端模型通常有更好的表现[45]。这是因为Charades是一个中等规模的数据集,且没有足够的多样性,以训练深度网络。

Table 4. Charades evaluation using mAP, calculated using the officially provided script. NL: non-local network. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet, K400: Kinetics400 and K600: Kinetics600.
表4:使用mAP进行的Charades评估,使用正式提供的脚本进行计算。NL:non-local网络。Pre-train指示对模型进行预训练的数据集。I:ImageNet,K400:Kinetics400和K600:Kinetics600。

Method Extra-information Backbone Pre-train Venue mAP
2D CNN[186] - AlexNet I ECCV 2016 11.2
Two-stream[186] flow VGG16 I ECCV 2016 22.4
ActionVLAD[63] - VGG16 I CVPR 2017 21.0
CoViAR[231] - ResNet50-like - CVPR 2018 21.9
MultiScale TRN[269] - BN-Inception-like I ECCV 2018 25.2
___ ___ ___ ___ ___ ___
I3D[14] - BN-Inception-like K400 CVPR 2017 32.9
STRG[220] - ResNet101-NL-like K400 ECCV 2018 39.7
LFB[229] - ResNet101-NL-like K400 CVPR 2019 42.5
TC[84] ResNet101-NL-like K400 ICCV 2019 41.1
HAF[212] IDT + flow BN-Inception-like K400 ICCV 2019 43.1
SlowFast[45] - ResNet-like K400 ICCV 2019 42.5
SlowFast[45] - ResNet-like K600 ICCV 2019 45.2
___ ___ ___ ___ ___ ___
Action-Genome[90] person + object ResNet-like - CVPR 2020 60.1
AssembleNet++[177] flow + object ResNet-like - ECCV 2020 59.9

Recently, researchers explored the alternative direction for complex concurrent action recognition by assembling models [177] or providing additional human-object interaction information [90]. These papers significantly outperformed previous literature that only finetune a single model on Charades. It demonstrates that exploring spatio-temporal human-object interactions and finding a way to avoid overfitting are the keys for concurrent action understanding.

最近,研究人员通过assembling model[177]或提供其他人对物体的交互信息[90]探索了复杂的并发动作识别的替代方向。这些论文大大优于以前的文献,后者仅对Charades上的单个模型进行了微调。它表明,探索时空人与物体之间的相互作用并找到避免过度拟合的方法是同时进行动作理解的关键。

Speed comparison(速度比较)

To deploy a model in real-life applications, we usually need to know whether it meets the speed requirement before we can proceed. In this section, we evaluate the approaches mentioned above to perform a thorough comparison in terms of (1) number of parameters, (2) FLOPS, (3) latency and (4) frame per second.

要在现实生活中的应用程序中部署模型,我们通常需要先知道模型是否满足速度要求,然后才能继续进行。在本节中,我们按以下指标(1)参数数量,(2)FLOPS,(3)延迟和(4)每秒帧数,对上述方法进行全面比较。

We present the results in Table 5. Here, we use the models in the GluonCV video action recognition model zoo since all these models are trained using the same data, same data augmentation strategy and under same 30-view evaluation scheme, thus fair comparison. All the timings are done on a single Tesla V100 GPU with 105 repeated runs, while the first 5 runs are ignored for warming up. We always use a batch size of 1. In terms of model input, we use the suggested settings in the original paper.

Table 5. Comparison on both efficiency and accuracy. Top: 2D models and bottom: 3D models. FLOPS means floating point operations per second. FPS indicates how many video frames can the model process per second. Latency is the actual running time to complete one network forward given the input. Acc is the top-1 accuracy on Kinetics400 dataset. TSN, I3D, I3D-slow families are pretrained on ImageNet. R2+1D, SlowFast and TPN families are trained from scratch.
表5:效率和准确性的同时比较。上方:2D模型,下方:3D模型。FLOPS表示每秒的浮点运算。FPS表示每秒可以处理多少个视频帧。延迟是在给定输入的情况下完成一个网络转发的实际运行时间。Acc是Kinetics400数据集下的top-1准确性。TSN,I3D,I3D-slow系列在ImageNet上进行了预培训。R2+1D,SlowFast和TPN系列是从头开始培训的。

Model Input FLOPS(G) # of params(M) FPS Latency(s) Acc(%)
TSN-ResNet18[218] 3x224x224 3.671 21.49 151.96 0.0066 69.85
TSN-ResNet34[218] 3x224x224 1.819 11.382 264.01 0.0038 66.73
TSN-ResNet50[218] 3x224x224 4.110 24.328 114.05 0.0088 70.88
TSN-ResNet101[218] 3x224x224 7.833 43.320 59.56 0.0167 72.25
TSN-ResNet152[218] 3x224x224 11.558 58.963 36.93 0.0271 72.45
___ ___ ___ ___ ___ ___ ___
I3D-ResNet50[14] 3x32x224x224 33.275 28.863 1719.50 0.0372 74.87
I3D-ResNet101[14] 3x32x224x224 51.864 52.574 1137.74 0.0563 75.10
I3D-ResNet50-NL[219] 3x32x224x224 47.737 38.069 1403.16 0.0456 75.17
I3D-ResNet101-NL[219] 3x32x224x224 66.326 61.780 999.94 0.0640 75.81
R2+1D-ResNet18[204] 3x16x112x112 40.645 31.505 804.31 0.0398 71.72
R2+1D-ResNet34[204] 3x16x112x112 75.400 61.832 503.17 0.0636 72.63
R2+1D-ResNet50[204] 3x16x112x112 65.543 53.950 667.06 0.0480 74.92
R2+1D-ResNet152*[204] 3x32x112x112 252.900 118.227 546.19 0.1172 81.34
CSN-ResNet152*[203] 3x32x224x224 74.758 29.704 435.77 0.1469 83.18
I3D-Slow-ResNet50[45] 3x8x224x224 41.919 32.454 1702.60 0.0376 74.41
I3D-Slow-ResNet50[45] 3x16x224x224 83.838 32.454 1406.00 0.0455 76.36
I3D-Slow-ResNet50[45] 3x32x224x224 167.675 32.454 860.74 0.0744 77.89
I3D-Slow-ResNet101[45] 3x8x224x224 85.675 60.359 1114.22 0.0574 76.15
I3D-Slow-ResNet101[45] 3x16x224x224 171.348 60.359 876.20 0.0730 77.11
I3D-Slow-ResNet101[45] 3x32x224x224 342.696 60.359 541.16 0.1183 78.57
SlowFast-ResNet50-4x16[45] 3x32x224x224 27.820 34.480 1396.45 0.0458 75.25
SlowFast-ResNet50-8x8[45] 3x32x224x224 50.583 34.566 1297.24 0.0493 76.66
SlowFast-ResNet101-8x8[45] 3x32x224x224 96.794 62.827 889.62 0.0719 76.95
TPN-ResNet50[248] 3x8x224x224 50.457 71.800 1350.39 0.0474 77.04
TPN-ResNet50[248] 3x16x224x224 99.929 71.800 1128.39 0.0567 77.33
TPN-ResNet50[248] 3x32x224x224 198.874 71.800 716.89 0.0893 78.90
TPN-ResNet101[248] 3x8x224x224 94.366 99.705 942.61 0.0679 78.10
TPN-ResNet101[248] 3x16x224x224 187.594 99.705 754.00 0.0849 79.39
TPN-ResNet101[248] 3x32x224x224 374.048 99.705 479.77 0.1334 79.70

我们将结果表示在表5中。在这里,我们使用GluonCV视频动作识别模型库中的模型,因为所有这些模型都是使用相同的数据,相同的数据增广策略和相同的30视图评估方案训练的,因此比较合理。所有时间都是在单个Tesla V100 GPU上进行的,且迭代105次,而前5次运行会因预热而被忽略。我们始终使用1的批次大小。在模型输入方面,我们使用原始论文中的建议设置。

As we can see in Table 5, if we compare latency, 2D models are much faster than all other 3D variants. This is probably why most real-world video applications still adopt frame-wise methods. Secondly, as mentioned in [170, 259], FLOPS is not strongly correlated with the actual inference time (i.e., latency). Third, if comparing performance, most 3D models give similar accuracy around 75%, but pretraining with a larger dataset can significantly boost the performance. This indicates the importance of training data and partially suggests that self-supervised pre-training might be a promising way to further improve existing methods.

如表5所示,如果比较延迟,则2D模型要比所有其他3D变体快得多。这可能就是为什么大多数现实世界的视频应用程序仍采用逐帧方法的原因。其次,如[170,259]中所述,FLOPS与实际推理时间(即等待时间)没有强烈的相关性。第三,如果比较性能,大多数3D模型可提供大约75%的相似精度,但是使用更大的数据集进行预训练可以显着提高性能。这表明训练数据的重要性,并部分表明自我监督的预培训可能是进一步改进现有方法的有前途的方法。

Discussion and Future Work(讨论与未来工作)

We have surveyed more than 200 deep learning based methods for video action recognition since year 2014. Despite the performance on benchmark datasets plateauing, there are many active and promising directions in this task worth exploring.

我们调查了从2014年开始的200多种基于深度学习的视频动作识别方法。尽管基准数据集的性能稳定,但仍有许多积极而有希望的方向值得探索。

Analysis and insights(分析与见解)

More and more methods haven been developed to improve video action recognition, at the same time, there are some papers summarizing these methods and providing analysis and insights. Huang et al. [82] perform an explicit analysis of the effect of temporal information for video understanding. They try to answer the question “how important is the motion in the video for recognizing the action”. Feichtenhofer et al. [48, 49] provide an amazing visualization of what two-stream models have learned in order to understand how these deep representations work and what they are capturing. Li et al. [124] introduce a concept, representation bias of a dataset, and find that current datasets are biased towards static representations. Experiments on such biased datasets may lead to erroneous conclusions, which is indeed a big problem that limits the development of video action recognition. Recently, Piergiovanni et al. introduced the AViD [165] dataset to cope with data bias by collecting data from diverse groups of people. These papers provide great insights to help fellow researchers to understand the challenges, open problems and where the next breakthrough might reside.

越来越多的方法被开发出来改善视频动作识别,同时,有一些论文总结了这些方法并提供了分析和见解。黄等人[82]对时间信息对视频理解的效果进行了明确的分析。他们试图回答“视频中的动作对于识别动作有多重要”的问题。Feichtenhofer等人[48,49]提供了一个令人惊奇的可视化视图,说明了双流模型学到了什么,以便了解这些深层表示的工作原理以及它们所捕获的内容。Li等人[124]介绍了一个概念,数据集的表示偏差,并发现当前的数据集偏向静态表示。在这种有偏见的数据集上进行实验可能会得出错误的结论,这确实是一个很大的问题,限制了视频动作识别的发展。最近,Piergiovanni等人引入了AViD[165]数据集,以通过从不同人群中收集数据来应对数据偏差。这些论文提供了深刻的见解,可以帮助研究人员了解挑战,未解决的问题以及下一个可能突破的地方。

Data augmentation(数据增广)

Numerous data augmentation methods have been proposed in image recognition domain, such as mixup [258], cutout [31], CutMix [254], AutoAugment [27], FastAutoAug [126], etc. However, video action recognition still adopts basic data augmentation techniques introduced before year 2015 [217, 188], including random resizing, random cropping and random horizontal flipping. Recently, SimCLR [17] and other papers have shown that color jittering and random rotation greatly help representation learning. Hence, an investigation of using different data augmentation techniques for video action recognition is particularly useful. This may change the data pre-processing pipeline for all existing methods.

在图像识别领域,已经提出了许多数据增广方法,例如mixup[258]、cutout[31]、CutMix[254],AutoAugment[27]、FastAutoAug[126]等。但是,视频动作识别仍采用基本数据在2015年之前引入的增广技术[217,188],包括随机调整大小,随机裁剪和随机水平翻转。最近,SimCLR[17]和其他论文表明,色彩抖动和随机旋转极大地帮助了表示学习。因此,使用不同的数据增广技术进行视频动作识别的研究特别有用。这可能会更改所有现有方法的数据预处理流程。

Video domain adaptation(视频域适配)

Domain adaptation (DA) has been studied extensively in recent years to address the domain shift problem. Despite the accuracy on standard datasets getting higher and higher, the generalization capability of current video models across datasets or domains is less explored. There is early work on video domain adaptation [193, 241, 89, 159]. However, these literature focus on smallscale video DA with only a few overlapping categories, which may not reflect the actual domain discrepancy and may lead to biased conclusions. Chen et al. [15] introduce two larger-scale datasets to investigate video DA and find that aligning temporal dynamics is particularly useful. Pan et al. [152] adopts co-attention to solve the temporal misalignment problem. Very recently, Munro et al. [145] explore a multi-modal self-supervision method for fine-grained video action recognition and show the effectiveness of multi-modality learning in video DA. Shuffle and Attend [95] argues that aligning features of all sampled clips results in a sub-optimal solution due to the fact that all clips do not include relevant semantics. Therefore, they propose to use an attention mechanism to focus more on informative clips and discard the non-informative ones. In conclusion, video DA is a promising direction, especially for researchers with less computing resources.

近年来,对域适配(Domain adaptation, DA)领域进行了广泛的研究,以解决领域转移问题。尽管标准数据集的准确性越来越高,但目前很少研究当前视频模型的跨数据集或域的泛化能力。这些是有关视频域自适应的早期工作[193,241,89,159]。但是,这些文献集中在只有几个重叠类别的小规模视频DA上,这可能无法反映实际的域差异,并可能导致结论有偏差。Chen等人[15]引入了两个较大规模的数据集来研究视频DA,并发现对齐时间动态性特别有用。Pan等人[152]采用共同注意解决时间错位问题。最近,Munro等人[145]探索了一种用于细粒度视频动作识别的多模式自我监督方法,并展示了多模式学习在视频D​​A中的有效性。Shuffle和Attend[95]认为,由于所有剪辑均不包含相关语义,因此将所有采样剪辑的特征对齐会导致次优解决方案。因此,他们建议使用一种注意力机制,将更多的注意力集中在信息剪辑上,而丢弃非信息剪辑。总之,视频DA是一个有前途的方向,特别是对于计算资源较少的研究人员而言。

Neural architecture search(神经结构搜索)

Neural architecture search (NAS) has attracted great interest in recent years and is a promising research direction. However, given its greedy need for computing resources, only a few papers have been published in this area [156, 163, 161, 178]. The TVN family [161], which jointly optimize parameters and runtime, can achieve competitive accuracy with human-designed contemporary models, and run much faster (within 37 to 100 ms on a CPU and 10 ms on a GPU per 1 second video clip). AssembleNet [178] and AssembleNet++ [177] provide a generic approach to learn the connectivity among feature representations across input modalities, and show surprisingly good performance on Charades and other benchmarks. AttentionNAS [222] proposed a solution for spatio-temporal attention cell search. The found cell can be plugged into any network to improve the spatio-temporal features. All previous papers do show their potential for video understanding. Recently, some efficient ways of searching architectures have been proposed in the image recognition domain, such as DARTS [130], Proxyless NAS [11], ENAS [160], oneshot NAS [7], etc. It would be interesting to combine efficient 2D CNNs and efficient searching algorithms to perform video NAS for a reasonable cost.

神经结构搜索(Neural architecture search, NAS)近年来引起了极大的兴趣,并且是一个有前途的研究方向。但是,由于对计算资源的贪婪需求,在该领域仅发表了几篇论文[156,163,161,178]。TVN系列[161]可以共同优化参数和运行时间,可以通过人工设计的现代模型实现具有竞争力的准确性,并且运行速度更快(每1秒视频片段在CPU上运行37至100ms,在GPU上运行10ms)。AssembleNet[178]和AssembleNet++[177]提供了一种通用方法来学习跨输入模态的特征表示之间的连通性,并在Charades和其他基准测试中表现出令人惊讶的良好性能。AttentionNAS[222]提出了一种用于时空注意力单元搜索的解决方案。可以将找到的单元插入任何网络以改善时空特征。以前的所有论文的确显示了其对视频理解的潜力。最近,在图像识别领域已经提出了一些搜索架构的有效方法,例如DARTS[130],Proxyless NAS[11],ENAS[160],oneshot NAS[7]等。2D CNN和高效的搜索算法结合可以合理的节省视频NAS成本。

Efficient model development(高效的模型开发)

Despite their accuracy, it is difficult to deploy deep learning based methods for video understanding problems in terms of real-world applications. There are several major challenges: (1) most methods are developed in offline settings, which means the input is a short video clip, not a video stream in an online setting; (2) most methods do not meet the real-time requirement; (3) incompatibility of 3D convolutions or other non-standard operators on non-GPU devices (e.g., edge devices). Hence, the development of efficient network architecture based on 2D convolutions is a promising direction. The approaches proposed in the image classification domain can be easily adapted to video action recognition, e.g. model compression, model quantization, model pruning, distributed training [68, 127], mobile networks [80, 265], mixed precision training, etc. However, more effort is needed for the online setting since the input to most action recognition applications is a video stream, such as surveillance monitoring. We may need a new and more comprehensive dataset for benchmarking online video action recognition methods. Lastly, using compressed videos might be desirable because most videos are already compressed, and we have free access to motion information.

尽管它们具有准确性,但难以部署基于深度学习的方法来解决现实应用中的视频理解问题。存在几个主要挑战:(1)大多数方法都是在基于离线设置开发的,这意味着输入内容是一个简短的视频剪辑,而不是在线设置中的视频流;(2)大多数方法不符合实时性要求;(3)非GPU设备(例如边缘设备)上的3D卷积或其他非标准运算符不兼容。因此,基于二维卷积的高效网络架构的发展是一个有前途的方向。在图像分类领域中提出的方法可以容易地适合于视频动作识别,例如,视频识别。模型压缩,模型量化,模型修剪,分布式训练[68、127],移动网络[80、265],混合精度训练等。但是,由于大多数动作识别应用程序的输入是视频流,例如视频监控。我们可能需要一个新的,更全面的数据集来对在线视频动作识别方法进行基准测试。最后,使用压缩视频可能是可取的,因为大多数视频已被压缩,并且我们可以自由访问运动信息。

New datasets(新数据集)

Data is more or at least as important as model development for machine learning. For video action recognition, most datasets are biased towards spatial representations [124], i.e., most actions can be recognized by a single frame inside the video without considering the temporal movement. Hence, a new dataset in terms of long-term temporal modeling is required to advance video understanding. Furthermore, most current datasets are collected from YouTube. Due to copyright/privacy issues, the dataset organizer often only releases the YouTube id or video link for users to download and not the actual video. The first problem is that downloading the large-scale datasets might be slow for some regions. In particular, YouTube recently started to block massive downloading from a single IP. Thus, many researchers may not even get the dataset to start working in this field. The second problem is, due to region limitation and privacy issues, some videos are not accessible anymore. For example, the original Kinetcis400 dataset has over 300K videos, but at this moment, we can only crawl about 280K videos. On average, we lose 5% of the videos every year. It is impossible to do fair comparisons between methods when they are trained and evaluated on different data.

数据与机器学习的模型开发一样重要或至少同样重要。对于视频动作识别,大多数数据集偏向于空间表示[124],即,大多数动作可以通过视频内的单个帧来识别,而无需考虑时间移动。因此,就长期时间建模而言,需要新的数据集来推进视频理解。此外,大多数最新的数据集都是从YouTube收集的。由于版权/隐私问题,数据集组织者通常仅发布YouTube ID或视频链接供用户下载,而不发布实际视频。第一个问题是在某些地区下载大规模数据集可能会很慢。特别是,YouTube最近开始阻止从单个IP进行大量下载。因此,许多研究人员甚至可能无法获得该数据集以开始在该领域中工作。第二个问题是,由于地区限制和隐私问题,一些视频不再可用。例如,原始的Kinetcis400数据集包含超过30万个视频,但是目前,我们只能抓取约28万个视频。平均而言,我们每年损失5%的视频。当对不同的数据进行训练和评估时,不可能在方法之间进行公平的比较。

Video adversarial attack(视频对抗攻击)

Adversarial examples have been well studied on image models. [199] first shows that an adversarial sample, computed by inserting a small amount of noise on the original image, may lead to a wrong prediction. However, limited work has been done on attacking video models. This task usually considers two settings, a white-box attack [86, 119, 66, 21] where the adversary can always get the full access to the model including exact gradients of a given input, or a black-box one [93, 245, 226], in which the structure and parameters of the model are blocked so that the attacker can only access the (input, output) pair through queries. Recent work ME-Sampler [260] leverages the motion information directly in generating adversarial videos, and is shown to successfully attack a number of video classification models using many fewer queries. In summary, this direction is useful since many companies provide APIs for services such as video classification, anomaly detection, shot detection, face detection, etc. In addition, this topic is also related to detecting DeepFake videos. Hence, investigating both attacking and defending methods is crucial to keeping these video services safe.

在图像模型上已经很好地研究了恶意攻击例子。[199]首先表明,通过在原始图像上插入少量噪声而计算出的对抗样本可能会导致错误的预测。但是,在攻击视频模型方面所做的工作有限。此任务通常考虑两种设置,白盒攻击[86,119,66,21],在这种攻击中,对手始终可以完全访问模型,包括给定输入的精确梯度,或者黑盒[93,245,226],其中模型的结构和参数被阻止,以使攻击者只能通过查询访问(输入,输出)对。ME-Sampler [260]的最新工作直接在生成对抗视频时利用了运动信息,并被证明可以使用更少的查询来成功地攻击多种视频分类模型。总而言之,该方向很有用,因为许多公司为诸如视频分类,异常检测,镜头检测,面部检测等服务提供API。此外,该主题还与检测DeepFake视频有关。因此,研究攻击和防御方法对于确保这些视频服务的安全至关重要。

Zero-shot action recognition(零镜头动作识别)

Zero-shot learning (ZSL) has been trending in the image understanding domain, and has recently been adapted to video action recognition. Its goal is to transfer the learned knowledge to classify previously unseen categories. Due to (1) the expensive data sourcing and annotation and (2) the set of possible human actions is huge, zero-shot action recognition is a very useful task for real-world applications. There are many early attempts [242, 88, 243, 137, 168, 57] in this direction. Most of them follow a standard framework, which is to first extract visual features from videos using a pretrained network, and then train a joint model that maps the visual embedding to a semantic embedding space. In this manner, the model can be applied to new classes by finding the test class whose embedding is the nearestneighbor of the model’s output. A recent work URL [279] proposes to learn a universal representation that generalizes across datasets. Following URL [279], [10] present the first end-to-end ZSL action recognition model. They also establish a new ZSL training and evaluation protocol, and provide an in-depth analysis to further advance this field. Inspired by the success of pre-training and then zero-shot in NLP domain, we believe ZSL action recognition is a promising research topic.

零镜头学习(Zero-shot learning, ZSL)在图像理解领域已成为趋势,并且最近已适应于视频动作识别。它的目标是转移学到的知识,对以前看不见的类别进行分类。由于(1)昂贵的数据源和注释,以及(2)人类可能进行的动作种类的集合很大,因此零镜头动作识别对于现实应用程序是非常有用的任务。在这个方向上有许多早期尝试[242、88、243、137、168、57]。它们中的大多数遵循标准框架,该框架首先使用预先训练的网络从视频中提取视觉特征,然后训练将视觉嵌入映射到语义嵌入空间的联合模型。通过这种方式,可以通过找到嵌入其与模型输出的最近邻的测试类,将模型应用于新类。最近的工作URL[279]提议学习一种通用表示形式,该表示形式可以概括整个数据集。在URL[279]之后,[10]提出了第一个端到端ZSL动作识别模型。他们还建立了新的ZSL训练和评估协议,并提供了深入的分析以进一步推进该领域。在NLP领域成功进行预训练然后零击成功的鼓舞下,我们认为ZSL动作识别是一个有前途的研究主题。

Weakly-supervised video action recognition(弱监督的视频动作识别)

Building a high-quality video action recognition dataset [190, 100] usually requires multiple laborious steps: 1) first sourcing a large amount of raw videos, typically from the internet; 2) removing videos irrelevant to the categories in the dataset; 3) manually trimming the video segments that have actions of interest; 4) refining the categorical labels. Weakly-supervised action recognition explores how to reduce the cost for curating training data.

建立高质量的视频动作识别数据集[190,100]通常需要多个费力的步骤:1)首先通常从互联网上获取大量原始视频;2)删除与数据集中的类别无关的视频;3)手动修剪具有感兴趣动作的视频片段;4)细化分类标签。弱监督的动作识别探索了如何减少策划训练数据的成本。

The first direction of research [19, 60, 58] aims to reduce the cost of sourcing videos and accurate categorical labeling. They design training methods that use training data such as action-related images or partially annotated videos, gathered from publicly available sources such as Internet. Thus this paradigm is also referred to as webly-supervised learning [19, 58]. Recent work on omni-supervised learning [60, 64, 38] also follows this paradigm but features bootstrapping on unlabelled videos by distilling the models’own inference results.

研究的第一个方向[19,60,58]旨在降低视频获取和准确分类标签的成本。他们设计训练方法,这些方法使用训练数据,例如与动作相关的图像或部分注释的视频,这些数据是从可公开获取的资源(例如Internet)中收集的。因此,这种范例也被称为网络监督学习[19,58]。在全监督学习[60,64,38]方面的最新工作也遵循了这一范例,但是通过提炼模型自己的推理结果,可以在未标记的视频上进行自举。

The second direction aims at removing trimming, the most time consuming part in annotation. UntrimmedNet [216] proposed a method to learn action recognition model on untrimmed videos with only categorical labels [149, 172]. This task is also related to weaklysupervised temporal action localization which aims to automatically generate the temporal span of the actions. Several papers propose to simultaneously [155] or iteratively [184] learn models for these two tasks.

第二个方向旨在消除修整,这是注释中最耗时的部分。UntrimmedNet[216]提出了一种方法,用于在仅带有分类标签的未修剪视频上学习动作识别模型[149,172]。此任务还与旨在自动生成动作时间跨度的弱监督时间动作本地化有关。几篇论文建议同时[155]或迭代[184]学习这两个任务的模型。

Fine-grained video action recognition(细粒度的视频动作识别)

Popular action recognition datasets, such as UCF101 [190] or Kinetics400 [100], mostly comprise actions happening in various scenes. However, models learned on these datasets could overfit to contextual information irrelevant to the actions [224, 227, 24]. Several datasets have been proposed to study the problem of fine-grained action recognition, which could examine the models’ capacities in modeling action specific information. These datasets comprise fine-grained actions in human activities such as cooking [28, 108, 174], working [103] and sports [181, 124]. For example, FineGym [181] is a recent large dataset annotated with different moves and sub-actions in gymnastic videos.

流行的动作识别数据集,例如UCF101[190]或Kinetics400[100],主要包含各种场景中发生的动作。但是,在这些数据集上学习的模型可能会过分适应与操作无关的上下文信息[224、227、24]。已经提出了一些数据集来研究细粒度的动作识别问题,这些数据集可以检查模型在为动作特定信息建模方面的能力。这些数据集包括人类活动中的细粒度动作,例如烹饪[28,108,174],工作[103]和运动[181,124]。例如,FineGym[181]是一个近期的大型数据集,注有体操视频中的不同动作和子动作。

Egocentric action recognition(以自我为中心的动作识别)

Recently, large-scale egocentric action recognition [29, 28] has attracted increasing interest with the emerging of wearable cameras devices. Egocentric action recognition requires a fine understanding of hand motion and the interacting objects in the complex environment. A few papers leverage object detection features to offer fine object context to improve egocentric video recognition [136, 223, 229, 180]. Others incorporate spatio-temporal attention [192] or gaze annotations [131] to localize the interacting object to facilitate action recognition. Similar to third-person action recognition, multi-modal inputs (e.g., optical flow and audio) have been demonstrated to be effective in egocentric action recognition [101].

最近,随着可穿戴式相机设备的出现,大规模的自我中心动作识别(egocentric action recognition)[29,28]引起了越来越多的兴趣。以自我为中心的动作识别需要对复杂环境中的手部动作和相互作用的物体有很好的理解。一些论文利用对象检测功能来提供良好的对象上下文,以改善以自我为中心的视频识别[136,223,229,180]。其他人则结合时空注意力[192]或注视信息[131]来定位交互对象以促进动作识别。与第三人称动作识别类似,多模式输入(例如,光流和音频)已被证明在以自我为中心的动作识别中是有效的[101]。

Multi-modality(多模式)

Multi-modal video understanding has attracted increasing attention in recent years [55, 3, 129, 167, 154, 2, 105]. There are two main categories for multi-modal video understanding. The first group of approaches use multimodalities such as scene, object, motion, and audio to enrich the video representations. In the second group, the goal is to design a model which utilizes modality information as a supervision signal for pre-training models [195, 138, 249, 62, 2].

近年来,多模式视频理解已引起越来越多的关注[55,3,129,167,154,2,105]。多模式视频理解有两个主要类别。第一组方法使用场景,对象,运动和音频等多模式来丰富视频表示。在第二组中,目标是设计一个模型,该模型利用模态信息作为预训练模型的监督信号[195、138、249、62、2]。

Multi-modality for comprehensive video understanding Learning a robust and comprehensive representation of video is extremely challenging due to the complexity of semantics in videos. Video data often includes variations in different forms including appearance, motion, audio, text or scene [55, 129, 166]. Therefore, utilizing these multi-modal representations is a critical step in understanding video content more efficiently. The multi-modal representations of video can be approximated by gathering representations of various modalities such as scene, object, audio, motion, appearance and text. Ngiam et al. [148] was an early attempt to suggest using multiple modalities to obtain better features. They utilized videos of lips and their corresponding speech for multi-modal representation learning. Miech et al. [139] proposed a mixture-of embedding-experts model to combine multiple modalities including motion, appearance, audio, and face features and learn the shared embedding space between these modalities and text. Roig et al. [175] combines multiple modalities such as action, scene, object and acoustic event features in a pyramidal structure for action recognition. They show that adding each modality improves the final action recognition accuracy. Both CE [129] and MMT [55], follow a similar research line to [139] where the goal is to combine multiple-modalities to obtain a comprehensive representation of video for joint video-text representation learning. Piergiovanni et al. [166] utilized textual data together with video data to learn a joint embedding space. Using this learned joint embedding space, the method is capable of doing zero-shot action recognition. This line of research is promising due to the availability of strong semantic extraction models and also success of transformers on both vision and language tasks.

多模式的全面视频理解,由于视频语义的复杂性和视频的强大而全面的表示非常具有挑战性。视频数据通常包括不同形式的变化,包括外观,运动,音频,文本或场景[55、129、166]。因此,利用这些多模式表示形式是更有效地理解视频内容的关键步骤。视频的多模式表示可以通过收集各种模式(例如场景,对象,音频,运动,外观和文本)的表示来近似。Ngiam等人[148]的一种早期尝试,建议使用多种方式来获得更好的功能。他们利用嘴唇的视频及其相应的语音进行多模式表示学习。Miech等人[139]提出了一种混合嵌入专业模型,以结合包括运动,外观,音频和面部特征在内的多种模态,并学习这些模态与文本之间的共享嵌入空间。Roig等人[175]在金字塔结构中组合了多种模式,例如动作,场景,物体和声音事件特征,以进行动作识别。他们表明,添加每种方式可以提高最终动作识别的准确性。CE[129]和MMT[55]都遵循与[139]类似的研究思路,其目标是结合多种模式以获得视频的全面表示,以进行联合的视频文本表示学习。Piergiovanni等人[166]利用文本数据和视频数据来学习联合嵌入空间。使用该学习的联合嵌入空间,该方法能够执行零镜头动作识别。由于强大的语义提取模型的可用性以及在视觉和语言任务上的转换程序的成功,这一研究领域是有前途的。

Multi-modality for self-supervised video representation learning Most videos contain multiple modalities such as audio or text/caption. These modalities are great source of supervision for learning video representations [3, 144, 154, 2, 162]. Korbar et al. [105] incorporated the natural synchronization between audio and video as a supervision signal in their contrastive learning objective for selfsupervised representation learning. In multi-modal selfsupervised representation learning, the dataset plays an important role. VideoBERT [195] collected 310K cooking videos from YouTube. However, this dataset is not publicly available. Similar to BERT, VideoBERT used a “masked language model” training objective and also quantized the visual representations into “visual words”. Miech et al. [140] introduced HowTo100M dataset in 2019. This dataset includes 136M clips from 1.22M videos with their corresponding text. They collected the dataset from YouTube with the aim of obtaining instructional videos (how to perform an activity). In total, it covers 23.6K instructional tasks. MIL-NCE [138] used this dataset for self-supervised cross-modal representation learning. They tackled the problem of visually misaligned narrations, by considering multiple positive pairs in the contrastive learning objective. ActBERT [275], utilized HowTo100M dataset for pre-training of the model in a self-supervised way. They incorporated visual, action, text and object features for cross modal representation learning. Recently AVLnet [176] and MMV [2] considered three modalities visual, audio and language for self-supervised representation learning. This research direction is also increasingly getting more attention due to the success of contrastive learning on many vision and language tasks and the access to the abundance of unlabeled multimodal video data on platforms such as YouTube, Instagram or Flickr. The top section of Table 6 compares multi-modal self-supervised representation learning methods. We will discuss more work on video-only representation learning in the next section.

多模式的自我监督视频表示学习,大多数视频包含多种模式,例如音频或文本/字幕。这些模式是学习视频表示的重要监督来源[3,144,154,2,162]。 Korbar等人[105]将音频和视频之间的自然同步作为一种监督信号纳入其自学习式表示学习的对比学习目标中。在多模式自我监督表示学习中,数据集起着重要作用。VideoBERT[195]从YouTube收集了310K烹饪视频。但是,此数据集不是公开可用的。与BERT相似,VideoBERT使用“掩盖语言模型”训练目标,并将视觉表示量化为“视觉单词”。Miech等人[140]在2019年引入了HowTo100M数据集。该数据集包括来自1.22M视频的1.36亿个剪辑及其相应的文本。他们从YouTube收集了数据集,目的是获得教学视频(如何进行活动)。总共涵盖23.6万个教学任务。MIL-NCE[138]使用此数据集进行自我监督的交叉模式表示学习。他们通过在对比学习目标中考虑多个正对来解决视觉上错位的叙述问题。ActBERT[275]利用HowTo100M数据集以自我监督的方式对模型进行了预训练。他们结合了视觉,动作,文本和对象特征,以进行跨模式表示学习。最近,AVLnet[176]和MMV[2]考虑了三种模式的视觉,音频和语言进行自我监督的表示学习。由于在许多视觉和语言任务上进行了对比学习的成功以及在YouTube,Instagram或Flickr等平台上访问了大量未标记的多模式视频数据,该研究方向也越来越受到关注。表6的顶部比较了多模式自我监督表示学习方法。我们将在下一节中讨论有关纯视频表示学习的更多工作。

Table 6. Comparison of self-supervised video representation learning methods. Top section shows the multi-modal video representation learning approaches and bottom section shows the video-only representation learning methods. From left to right, we show the selfsupervised training setting, e.g. dataset, modalities, resolution, and architecture. Two last right columns show the action recognition results on two datasets UCF101 and HMDB51 to measure the quality of self-supervised pre-trained model. HTM: HowTo100M, YT8M: YouTube8M, AS: AudioSet, IG-K: IG-Kinetics, K400: Kinetics400 and K600: Kinetics600.
表6:自我监督视频表示学习方法的比较。顶部显示多模式视频表示学习方法,底部显示仅视频表示学习方法。从左到右,我们显示了自我监督的训练设置,例如数据集,模态,分辨率和体系结构。最后两个右栏显示了两个数据集UCF101和HMDB51上的动作识别结果,以测量自我监督的预训练模型的质量。HTM:HowTo100M,YT8M:YouTube8M,AS:AudioSet,IG-K:IG-Kinetics,K400:Kinetics400和K600:Kinetics600。

Method Dataset Video Audio Text Size Backbone Venue UCF101 Linear FT HMDB51 Linear FT
AVTS [105] K400 X X - 224 R(2+1)D-18 NeurIPS 2018 - 86.2 - 52.3
AVTS [105] AS X X - 224 R(2+1)D-18 NeurIPS 2018 - 89.1 - 58.1
CBT [194] K600+ X - X 112 S3D arXiv 2019 54.0 79.5 29.5 44.6
MIL-NCE [138] HTM X - X 224 S3D CVPR 2020 82.7 91.3 53.1 61.0
ELO [162] YT8M X X - 224 R(2+1)D-50 CVPR 2020 – 93.8 64.5 67.4
XDC [3] K400 X X - 224 R(2+1)D-18 NeurIPS 2020 - 86.8 - 52.6
XDC [3] AS X X - 224 R(2+1)D-18 NeurIPS 2020 - 93.0 - 63.7
XDC [3] IG65M X X - 224 R(2+1)D-18 NeurIPS 2020 - 94.6 - 66.5
XDC [3] IG-K X X - 224 R(2+1)D-18 NeurIPS 2020 - 95.5 - 68.9
AVID [144] AS X X - 224 R(2+1)D-50 arXiv 2020 - 91.5 - 64.7
GDT [154] K400 X X - 112 R(2+1)D-18 arXiv 2020 - 89.3 - 60.0
GDT [154] AS X X - 112 R(2+1)D-18 arXiv 2020 - 92.5 - 66.1
GDT [154] IG65M X X - 112 R(2+1)D-18 arXiv 2020 - 95.2 - 72.8
MMV [2] AS+HTM X X X 200 S3D NeurIPS 2020 89.6 92.5 62.6 69.6
MMV [2] AS+HTM X X X 200 TSM-50x2 NeurIPS 2020 91.8 95.2 67.1 75.0
___ ___ ___ ___ ___ ___ ___
OPN [115] UCF101 X - - 227 VGG ICCV 2017 - 59.6 - 23.8
3D-RotNet [94] K400 X - - 112 R3D arXiv 2018 - 62.9 - 33.7
ST-Puzzle [102] K400 X - - 224 R3D AAAI 2019 - 63.9 - 33.7
VCOP [240] UCF101 X - - 112 R(2+1)D CVPR 2019 - 72.4 - 30.9
DPC [71] K400 X - - 128 R-2D3D ICCVW 2019 - 75.7 - 35.7
SpeedNet [6] K400 X - - 224 S3D-G CVPR 2020 - 81.1 - 48.8
MemDPC [72] K400 X - - 224 R-2D3D ECCV 2020 54.1 86.1 30.5 54.5
CoCLR [73] K400 X - - 128 S3D NeurIPS 2020 74.5 87.9 46.1 54.6
CVRL [167] K400 X - - 224 R3D-50 arXiv 2020 - 92.2 - 66.7
CVRL [167] K600 X - - 224 R3D-50 arXiv 2020 - 93.4 - 68.0

Self-supervised video representation learning(自我监督的视频表示学习)

Self-supervised learning has attracted more attention recently as it is able to leverage a large amount of unlabeled data by designing a pretext task to obtain free supervisory signals from data itself. It first emerged in image representation learning. On images, the first stream of papers aimed at designing pretext tasks for completing missing information, such as image coloring [262] and image reordering [153, 61, 263]. The second stream of papers uses instance discrimination [235] as the pretext task and contrastive losses [235, 151] for supervision. They learn visual representation by modeling visual similarity of object instances without class labels [235, 75, 201, 18, 17].

自我监督学习最近吸引了更多关注,因为它能够通过设计预置任务从数据本身获取免费的监督信号来利用大量未标记的数据。它首先出现在图像表示学习中。关于图像,第一批论文旨在设计用于完成缺失信息的预置任务,例如image coloring[262]和image reordering[153、61、263]。第二篇论文使用instance discrimination[235]作为预置任务,使用contrastive losses[235,151]进行监督。他们通过对没有类标签的对象实例的视觉相似性进行建模来学习视觉表示[235、75、201、18、17]。

Self-supervised learning is also viable for videos. Compared with images, videos has another axis, temporal dimension, which we can use to craft pretext tasks. Information completion tasks for this purpose include predicting the correct order of shuffled frames [141, 52] and video clips [240]. Jing et al. [94] focus on the spatial dimension only by predicting the rotation angles of rotated video clips. Combining temporal and spatial information, several tasks have been introduced to solve a space-time cubic puzzle, anticipate future frames [208], forecast long-term motions [134] and predict motion and appearance statistics [211]. RSPNet [16] and visual tempo [247] exploit the relative speed between video clips as a supervision signal.

自我监督的学习对于视频也是可行的。与图像相比,视频具有另一个轴,即时间维度,我们可以使用它来制作预置任务。为此目的的信息完成任务包括预测shuffled frames[141、52]和视频剪辑[240]的正确顺序。Jing等人[94]仅通过预测旋转视频剪辑的旋转角度将注意力集中在空间维度上。结合时间和空间信息,已经引入了若干任务来解决时空立方难题,预期未来帧[208],预测长期运动[134]以及预测运动和外观统计[211]。RSPNet[16]和visual tempo[247]利用视频剪辑之间的相对速度作为监督信号。

The added temporal axis can also provide flexibility in designing instance discrimination pretexts [67, 167]. Inspired by the decoupling of 3D convolution to spatial and temporal separable convolutions [239], Zhang et al. [266] proposed to decouple the video representation learning into two sub-tasks: spatial contrast and temporal contrast. Recently, Han et al. [72] proposed memory augmented dense predictive coding for self-supervised video representation learning. They split each video into several blocks and the embedding of future block is predicted by the combination of condensed representations in memory.

添加的时间轴还可以在设计不同实例的预置任务[67、167]时提供灵活性。受到3D卷积与时空可分离卷积解耦的启发[239],Zhang等人[266]提出将视频表示学习分解为两个子任务:空间对比度和时间对比度。最近,Han等人[72]提出了用于自我监督的视频表示学习的存储器增强的密集预测编码。他们将每个视频分割成几个块,并通过压缩表示形式在内存中的组合来预测未来块的嵌入。

The temporal continuity in videos inspires researchers to design other pretext tasks around correspondence. Wang et al. [221] proposed to learn representation by performing cycle-consistency tracking. Specifically, they track the same object backward and then forward in the consecutive video frames, and use the inconsistency between the start and end points as the loss function. TCC [39] is a concurrent paper. Instead of tracking local objects, [39] used cycle-consistency to perform frame-wise temporal alignment as a supervision signal. [120] was a follow-up work of [221], and utilized both object-level and pixel-level correspondence across video frames. Recently, long-range temporal correspondence is modelled as a random walk graph to help learning video representation in [87].

视频中的时间连续性激励研究人员围绕函授设计其他预置任务。Wang等人[221]提出通过执行循环一致性跟踪来学习表示。具体来说,它们在连续的视频帧中向后跟踪相同的对象,然后向前跟踪,并将起点和终点之间的不一致用作损失函数。TCC[39]是一篇并发的论文。而不是跟踪局部对象,[39]使用循环一致性来执行逐帧时间对齐作为监督信号。[120]是[221]的后续工作,并利用了视频帧之间的对象级和像素级对应关系。最近,在[87]中,远程时间对应关系被建模为random walk graph,以帮助学习视频表示。

We compare video self-supervised representation learning methods at the bottom section of Table 6. A clear trend can be observed that recent papers have achieved much better linear evaluation accuracy and fine-tuning accuracy comparable to supervised pre-training. This shows that selfsupervised learning could be a promising direction towards learning better video representations.

我们在表6的底部比较了视频自我监督的表示学习方法。可以看到一个明显的趋势,即与监督预训练相比,最近的论文已经获得了更好的线性评估精度和微调精度。这表明自我监督学习可能是一个有希望学习更好的视频表现的方向。

Conclusion(总结)

In this survey, we present a comprehensive review of 200+ deep learning based recent approaches to video action recognition. Although this is not an exhaustive list, we hope the survey serves as an easy-to-follow tutorial for those seeking to enter the field, and an inspiring discussion for those seeking to find new research directions.

在本综述中,我们对200多种基于深度学习的视频动作识别方法进行了全面回顾。尽管这并不是一个详尽的列表,但我们希望该综述对正寻求进入该领域的研究者来说是一个易于理解的教程,对于那些寻求新的研究方向的研究者来说是一个启发性的讨论。

Acknowledgement(致谢)

We would like to thank Peter Gehler, Linchao Zhu and Thomas Brady for constructive feedback and fruitful discussions.

我们要特别感谢Peter Gehler,Linchao Zhu和Thomas Brady的建设性反馈和富有成果的讨论。