邹月娴论文17-杨邦PRCV2022更新至网站).pdf
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter Bang Yang1,2 , Tong Zhang2 , and Yuexian Zou1,2(B) 1 ADSPLAB, Shenzhen Graduate School, Peking University, Shenzhen, China {yangbang,zouyx}@pku.edu.cn 2 Peng Cheng Laboratory, Shenzhen, China zhangt02@pcl.ac.cn Abstract. For video captioning, “pre-training and fine-tuning” has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption generation. This paper first investigates the impact of the recently proposed CLIP (Contrastive Language-Image Pre-training) on video captioning. Through the empirical study on INP vs. CLIP, we identify the potential deficiencies of INP and explore the key factors for accurate description generation. The results show that the INP-based model is tricky to capture concepts’ semantics and sensitive to irrelevant background information. By contrast, the CLIP-based model significantly improves the caption quality and highlights the importance of concept-aware representation learning. With these findings, we propose Dual Concept Detection (DCD) further to inject concept knowledge into the model during training. DCD is an auxiliary task that requires a caption model to learn the correspondence between video content and concepts and the co-occurrence relations between concepts. Experiments on MSR-VTT and VATEX demonstrate the effectiveness of DCD, and the visualization results further reveal the necessity of learning concept-aware representations. Keywords: Video captioning · Representation learning · Concept detection 1 Introduction Video captioning aims to describe video content with fluent sentences. Given the difficulties of learning effective video representations from limited data [37,38], mainstream video captioning methods adopt the Encoder-Decoder framework [36] with a “pre-training and fine-tuning” paradigm, where ImageNet Pretraining (INP) is usually used to help encode the video content, and a taskoriented network is fine-tuned from scratch to cope with caption generation. However, using INP across discrepant tasks may bring limited benefit [9]. Recent advances in video captioning [4,17,20,21,39,42,43] are built upon the default use of INP and meanwhile, their performance gradually becomes c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Yu et al. (Eds.): PRCV 2022, LNCS 13534, pp. 368–381, 2022. https://doi.org/10.1007/978-3-031-18907-4_29 CLIP Meets Video Captioning 369 Fig. 1. ImageNet Pre-training (INP) vs. Contrastive Language-Image Pre-training (CLIP). When using CLIP rather than INP to help encode the video content, performance is greatly improved (left), which can be attributed to the better learning of concept-aware representations (right). The right part visualizes Grad-CAM [30] for the ground-truth caption “spongebob squarepants blows bubbles”. The caption below each example is model’s actual prediction. saturated on MSR-VTT [38]. Thus, immediate questions raise: is INP causing a performance bottleneck for video captioning? If so, why? To answer these questions, we turn our attention to CLIP (Contrastive Language-Image Pre-training) [28], which has drawn great attention in the community due to its strong zeroshot transfer ability to various vision tasks. As a branch of vision-language pretraining research [16,32], CLIP is unbounded by a fixed set of labels and its pre-training data, i.e., 400M noisy image-text pairs crawled from the Internet, is larger than ImageNet by an order of magnitude. To this end, we hypothesize that CLIP has great potentials for video captioning. In this paper, we carry out an empirical study on INP vs. CLIP to shed new light on potential deficiencies of INP for caption generation and explore the key to prompting accurate video captioning. Figure 1 gives a snapshot of our experimental study. We can see from the curves that captioning performance is significantly improve when using CLIP rather than INP to help encode the video content. This performance gap can be interpreted by the visualized example shown on the right, where INP deviates the video caption model’s focus from the critical regions of concepts “spongebob squarepants” and “bubbles” while CLIP’s results just the opposite. As a result, the empirical study has shown that conceptaware representation learning does matter to accurate caption generation. Based on the above finding, we propose Dual Concept Detection (DCD) to spur video caption models to learn concept-aware video and text representations during training. Specifically, DCD requires the the caption model to infer relevant concepts based on partial information from no matter videos or text descriptions. To achieve that, the model has to build the correspondence between video content and concepts and learn the concepts’ co-occurrence relations. To summarize, we make the following contributions. (1) We carry out an empirical study on INP vs. CLIP for video captioning. The results reveal the deficiencies of INP and suggest the importance of concept-aware representation learning in prompting accurate captioning. (2) Motivated by the success of CLIP 370 B. Yang et al. for video captioning, we introduce Dual Concept Detection, an auxiliary task that can be jointly trained with video caption models to strengthen their learning of concept-aware representations during training. (3) Experiments on MSR-VTT [38] and VATEX [37] verify the effectiveness of our approach and ablation studies clearly show what leads to the improvements. 2 Related Work Video Captioning. The “pre-training and fine-tuning” paradigm is commonly used from the very first neural-network-based methods [35,36] to the present day. Recent focuses in video captioning include but not limited to (1) learning finegrained representation via graph neural models [4,25,43] or hierarchical encoders [39], (2) distilling knowledge mutually [20] or from external models [43], and (3) introducing new paradigms like non-autoregressive captioning [21,40] and openbook captioning [42]. However, recent advanced methods pay less attention to the effect of pre-training models, suffering from potential performance bottleneck. Vision-Language Pre-training (VLP). Unlike single-modal understanding, VLP aims to bridge vision and language by modeling their interactions. Generally, existing VLP methods can be categorized into two groups: (1) learning vision-language joint representations [13,16,18,22,32] and (2) learning visual representations from natural language supervision [6,28,29]. For the former, a deep cross-modal Transformer [33] is usually used to fuse multi-modal inputs and learn contextualized features. Among the latter, CLIP [28] drew great attention because of its strong zero-shot transfer ability. Unlike the focused tasks in recent works [23,31], we analyze the effect of CLIP on video captioning. Multi-task Learning (MTL). The goal of MTL is to learn shared representations from multiple related tasks to improve the generalization performance of all tasks [3]. For video captioning, Pasunuru and Bansal [27] proposed to train a video caption model with two directed-generation tasks, whose supervision signal was obtained from external datasets. By contrast, more attempts were made to construct auxiliary tasks by deriving additional supervision signal from the original annotations, e.g., predicting mined latent topics [5] or extracted attributes [12,15,41] solely based on the input videos. The auxiliary task proposed in this paper instead takes either video content or textual descriptions as input. 3 On INP vs. CLIP for Video Captioning This section aims to investigate the potential deficiencies of INP and explore the key to generating accurate descriptions. We organize this section as follows. We first briefly review video captioning in Sect. 3.1. Then, we introduce a Transformer baseline for video captioning in Sect. 3.2, where we will show how to integrate INP or CLIP models with the baseline. Finally, based on the experimental setup in Sect. 3.3, we present our analysis in Sect. 3.4. CLIP Meets Video Captioning 371 Fig. 2. Pipeline of video captioning at the training stage. In Sect. 3, we will review (a) and focus on the encoder part of video caption models in which conventional methods usually use ImageNet pre-training models to encode the video content. In Sect. 4, we will elaborate our proposed Dual Concept Detection (DCD) shown in (b), where “SS” denotes the sparse sampling. 3.1 Overview of Video Captioning As shown in Fig. 2(a), conventional video captioning methods build upon the Encoder-Decoder framework. Formally, given a sequence of sampled video frames (or snippets) V = {v1 , v2 , . . . , vN } of length N , the caption encoder aims to encode V into video features F ∈ Rdh ×N : F = Encoder(V ). (1) After the encoding stage, the decoding stage uses a typical autoregressive manner to generate a caption Y = {y1 , y2 , . . . , yT } of length T . Specifically at t-th time step, previously generated words Y