邹月娴论文18-吉普照PRCV2022更新至网站）.pdf

Consensus-Guided Keyword Targeting for Video Captioning Puzhao Ji1 , Bang Yang1,2 , Tong Zhang2 , and Yuexian Zou1,2(B) 1 ADSPLAB, School of ECE, Peking University, Shenzhen, China zouyx@pku.edu.cn 2 Peng Cheng Laboratory, Shenzhen, China Abstract. Mainstream video captioning models (VCMs) are trained under fully supervised learning that relies heavily on large-scaled highquality video-caption pairs. Unfortunately, evaluating the corpora of benchmark datasets shows that there are many defects associated with humanly labeled annotations, such as variation of the caption length and quality for one video and word imbalance in captions. Such defects may pose a signiﬁcant impact on model training. In this study, we propose to lower down the adverse impact of annotations and encourage VCMs to learn high-quality captions and more informative words via Consensus-Guided Keyword Targeting (CGKT) training strategy. Specifically, CGKT ﬁrstly aims at re-weighting each training caption using a consensus-based metric named CIDEr. Secondly, CGKT attaches more weights to those informative and uncommonly used words based on their frequency. Extensive experiments on MSVD and MSR-VTT show that the proposed CGKT can easily work with three VCMs to achieve significant CIDEr improvements. Moreover, compared with the conventional cross-entropy objective, our CGKT facilitates the generation of more comprehensive and better-quality captions. Keywords: Video captioning · Annotation quality · Consensus guidance · Keyword targeting 1 Introduction Video captioning (VC) aims to automatically describe video content with natural language sentences that can accurately depict the key information of video content, i.e., scenes, actions, and objects. In recent years, mainstream VC methods follow the popular Encoder-Decoder framework under fully supervised manner [1–3], in which the encoder is responsible for video comprehension whereas the decoder is for description generation. With the emergence of humanly-annotated VC datasets like MSVD [4] and MSR-VTT [5], many VC methods have been proposed and mainly devoted to model design [1,2,7–9] or multimodal fusion [3,10,11]. Since the mainstream VCMs are trained in a fully supervised fashion, so the quality of annotations c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Yu et al. (Eds.): PRCV 2022, LNCS 13536, pp. 270–281, 2022. https://doi.org/10.1007/978-3-031-18913-5_21 Consensus-Guided Keyword Targeting for Video Captioning 271 Fig. 1. Illustration of the annotation quality in MSR-VTT dataset. In (a), we quantitatively measure the annotation quality of 130,260 captions in the training split by standard automatic metrics, we ﬁnd that the sentence quality of the training corpus is uneven and more than 60.6% sentences have lower CIDEr scores than the average; whereas in (b), we mark problems of annotations including oversimpliﬁed sentence structure and content mismatch between sentence and video. have an important inﬂuence on the performance of these VCMs. However, the annotation quality of the crowd-sourced VC datasets is rarely explored. To better understand annotation quality of the VC datasets, we ﬁrst quantify sentence-level annotation quality via standard automatic metrics, including BLEU@4 [12], METEOR [13], and ROUGE L [14]. Given that multiple groundtruth (GT) captions are annotated for a video, we follow the leave-one-out procedure, i.e., treating one GT caption as the prediction while the rest as the references, to calculate per-sentence scores. As shown in Fig. 1(a), we quantify the quality of 130,260 captions in the MSR-VTT training set and sort them according to their CIDEr scores. It can be found that the sentence quality of the training corpus is uneven and more than 60.6% sentences have lower CIDEr scores than the average. Figure 1(b) further reveals possible noises in annotations for a speciﬁc video in MSR-VTT, like oversimpliﬁed sentence structure and content mismatch between sentence and video. Next, we turn our attention to word-level annotation quality. By analyzing word frequency in the MSRVTT corpus, we found that words with high frequency are common words or functional words (e.g., “person” and “is”) while words that correspond to the details within the video content are less frequent (e.g. “frying” and “dough”). Such imbalanced word distribution, however, pose great challenges to model training [16]. All these ﬁndings suggest that existing VC datasets suﬀer from sentence-level uneven quality and word-level imbalance problems, which post adverse impact on the performance of VCMs. However, the widely adopted training objective, i.e., Cross-Entropy loss treats training samples equally at both sentence and word levels regardless of their uneven quality. Therefore, we propose an improved training objective in this paper, named Consensus-Guided Keyword Targeting (CGKT), to address the uneven sentence quality and imbalanced word distribution problems in VC 272 P. Ji et al. datasets. Speciﬁcally, CGKT is comprised of two parts: Consensus-Guided Loss (CGL) and Keyword Targeting Loss (KTL), which account for sentence- and word-level re-weighting, respectively. In implementations, CGL takes CIDEr, a consensus-based metric for the captioning task, as the standard to encourage or punish the learning of each training caption in the corpus. Focusing on word-level rewards, KTL attaches weights to those uncommon words whereas neglects their counterparts. It is noteworthy that CGKT is a training algorithm that requires no modiﬁcation to the model architecture, making it easy to be integrated into existing video captioning models. Besides, we note that many methods have been developed in other ﬁelds to mitigate the negative impact of imbalance problems, e.g., re-sampling strategies [17,18] and gradient re-weighting [19]. Nevertheless, these methods are dedicated for unstructured class labels and do not perform as well as our proposed CGKT for structural data (i.e., sentences) in the VC task, as we will demonstrate later. The contributions of this paper are three-fold: (1) We analyze the annotation quality of MSR-VTT and ﬁnd that various problems exist in the data. (2) To train a better video captioning model, we propose an alternative training objective named Consensus-Guided Keyword Targeting (CGKT) to encourage the model to learn high-quality sentences and keywords. (3) Extensive experiments on MSVD and MSR-VTT datasets show that various baselines trained with our proposed CGKT can achieve up to 8.2% improvement in terms of CIDEr score and describe more distinctive and accurate details of video content. 2 Related Work 2.1 Video Captioning The early works of VC extract ﬁxed content like the verb, subject, and object, then populate the content into a predeﬁned template [6]. Withing ﬁxed predeﬁned templates and limited hand-crafted grammar rules, these methods are hard to generate ﬂexible and accurate descriptions. Beneﬁt from the raising of deep neural networks, sequence learning based methods [1,24] that adopt an encoderdecoder framework, are widely used to describe video content with ﬂexibility. Recent VC works devoted to the designing of the captioning model [7,8,24], including adding a soft attention mechanism in the encoder to allow the model to focus on signiﬁcant frames or regions, proposing a module to choose keyframes as inputs to reduce redundant visual information, or using a reconstructor architecture to leverage the backﬂow from sentence to video while generating caption. In addition, extra semantic information including objects, syntax parts, and latent topics is used to assist in generating captions [3,9,11]. However, all the above methods focus on multimodal feature modeling or syntax structure reasoning and lack analysis and coping methods for uneven quality and imbalance problems in video caption datasets. Consensus-Guided Keyword Targeting for Video Captioning 2.2 273 Video Captioning Datasets MSVD and MSR-VTT are two datasets that are widely used in VC works. MSVD was proposed by Chen et al. [4] in 2011, which is the ﬁrst open-domain VC dataset. The authors of MSVD used Amazon’s Mechanical Turk (AMT) to collect and annotate data. In the data collection stage, crowdsourced annotators were asked to ﬁnd a clip that contains a single, unambiguous event on YouTube and submit the link of this clip. In the data annotation stage, the author requires crowdsourcing annotators to watch the video and write descriptive sentences in any language within 80 s, and then get the ﬁnal English annotation sentences through translation. However, the author did not review and ﬁlter the sentences annotated by the annotators, which resulted in a large number of sentences of poor quality in the ﬁnal annotation. Another mainstream VC dataset, MSRVTT, was proposed by Xu et al. [5] in 2016. It also uses the AMT to collect and annotate data. In the data collection stage, the author searched 20 keywords on video websites to collect video data and asked the annotators to ﬁnd the clip closest to the keywords within 20 s in the video. MSR-VTT also did not review and ﬁlter the sentences annotated by crowdsourced annotators. These two mainstream datasets are used as benchmarks for current VC works [3,9,10,25], but no eﬀort has yet been made to identify and analyze their quality issues. 3 Method 3.1 Encoder-Decoder Framework As shown in Fig. 2, our model follows the Encoder-Decoder framework and is supervised by CGKT Loss. In the encoding phase, the video V is sampled into frames, which are fed into 2D-CNN and 3D-CNN models to attain RGB feature Fr and motion feature Fm . These multi-modal features are input to the multimodality fusion module to obtain the input of the decoder F . Then, in the decoding stage, decoder generate description Ŷ = {ŷ1 , ŷ2 , ..., ŷT } where T is the length of caption. θ denotes the parameter to be learned. Fig. 2. The framework of our Consensus-Guided Keyword Targeting captioning model. Human annotations in green and red color mean high quality label and poor label respectively. Using CGKT loss, our model will focus on higher quality sentences and more representative words. (Color ﬁgure online) 274 3.2 P. Ji et al. Consensus-Guided Loss We introduce Consensus-Guided Loss (CGL) to mitigate the uneven quality problem at the sentence level. The existing video captioning methods treat annotations of diﬀerent qualities equally, which leads to noisy and biased supervisory signals and weak generalization performance. To mitigate this problem, we adopt the annotation quality score, i.e., CIDEr score as the weight to guide the model to learn more from higher quality annotations. We denote captioning annotations as Y i = {Y1i , Y2i , ..., YQi } in which Q is the caption number of i-th video clip. We take Eqi as references to measure the quality score G(Yqi ) of caption Yqi before training, in which Eqi is the set of Y i except caption Yqi . All the video captioning metrics can be used for the calculation of consensus score. We use CIDEr [15] as quality score in CGL. iq } in which M is the The n-grams of Yqi are denoted as ω iq = {ω1iq , ω2iq , ..., ωM i number of n-grams belonging to Yq . To compute the i-th annotation’s consensus score, we ﬁrstly get the Term Frequency Inverse Document Frequency (TF-IDF) of each n-gram in Yqi . iq )= t(ωm iq ) h(ωm iq ωjiq ∈ω iq h(ωj ) log( |V| ) iq Vp ∈V min(1, l hlp (ωm )) (1) iq iq ) is the number of occurrences of ωm in reference caption Yqi , V is where h(ωm iq iq the set of all videos and hlp (ωm ) means the number of times n-gram ωm appears p in caption Yl . Using the average cosine similarity between target caption and reference caption, we get the n-grams score. CIDErn (Yqi , Eqi ) = 1 tn (Yqi ) · tn (Yji ) Q−1 tn (Yqi ) · tn (Yji ) (2) j=i where tn (Yqi ) is the weighting vector composed by all n-grams of length n in Yqi . Then, we sum up all score for N = 4 as ﬁnal annotation consensus score. G(Yqi ) = N CIDErn (Yqi , Eqi ) (3) n=1 After using consensus score G to re-weight every annotation, we can form the following Consensus-Guided Loss (CGL). 1 Lcg (Yqi , Ŷ ) = −G(Yqi ) T T t=1 3.3 log pθ (ŷt = ytiq |ŷ1:t−1 , F ) (4) Keyword Targeting Loss The Keyword Targeting Loss (KTL) is designed to alleviate the imbalance problem at the word level. The traditional objective of video captioning trades all word equally which cause captioning model are insensitive to these words that Consensus-Guided Keyword Targeting for Video Captioning 275 are less frequent but contain more accurate information. To solve this problem, we design KTL to oversample these less frequent words and make the model focus on learning from them. We count the frequency of each words in the dataset and select K words with the lowest frequency as keywords in each annotation. The keywords of caption Yqi are denoted as KW iq = {ytiq1 , ytiq2 , ..., ytiqK }, and the following is Keyword Targeting Loss. The value of β is 0.02 times of epoch number in training. t K β log pθ (ŷj = yjiq |ŷ1:j−1 , F ) Lkt (KW , Ŷ ) = − K j=t iq (5) 1 3.4 Consensus-Guided Keyword Targeting Captioning Model We combine CGL Lcg and KTL Lkt to create our Consensus-Guided Keyword Targeting loss. L = Lkt + Lcg (6) A CGKT captioning model can be formed by replacing objective with CGKT loss without changing the network structure or extracting features again. We implement three CGKT captioning models using the architecture in S2VT [1], AttLSTM [24], and Semantic [20]. Given RGB feature Fr and temporal feature Fm . S2VT model concatenates them and takes single fully connected layer as the Multi-modality Fusion Module. The decoder of S2VT model is implemented with single layer LSTM. psθ is the probability of words in vocabulary generated by S2VT model, Ws and bs are learnable parameters. psθ (ŷt ) = LSTM(Ws [Fr ; Fm ] + bs , hst−1 ) (7) For the Semantic model, we use the architecture in [20] as Multi-modality Fusion Module except replacing the objective with CGKT loss. AttLSTM also uses single-layer LSTM as decoder but fuses multi-modality features using an additive attention module at each time step of the decoder. paθ is the probability of words in vocabulary generated by AttLSTM model, va , Wa1 , and Wa2 are learnable parameters. paθ (ŷt ) = LSTM(va tanh(Wa1 ŷt−1 + Wa2 [Fr ; Fm ]), hat−1 ) 4 Experiments 4.1 Datasets and Metrics. (8) We experiment on two widely used video captioning datasets: MSVD [4] and MSR-VTT [5]. Following the convention [3,9,10], the MSVD is split 276 P. Ji et al. into 1,200, 100, and 670 videos for training, validation, and testing. Following the oﬃcial instruction [5], we split MSR-VTT into 6513/497/2990 for training/validation/testing. To evaluate the performance of our method, we employ four commonly used metrics, including BLEU@4 [12], METEOR [13], ROUGE L [14], and CIDEr [15]. 4.2 Implementation Details We sample 32 frames from video uniformly, and then feed these frames to the ResNeXt model pre-trained on the ImageNet ILSVRC2012 dataset. We take the output of the last conv layer of the ResNeXt [22] model as RGB feature which is 2048 dimension. The dynamic temporal feature of 1536 dimension is extracted by ECO [21] model which is pre-trained on the Kinetics400 dataset. For the MSR-VTT dataset, we apply the pre-trained word embedding model Glove to obtain word 300-dim vectors of 20 video categories. We used are 512 for all the hidden size of LSTM, and our embedding layer dimension size is 300. We employ Adam as our model’s optimizer. Following the setting in [20] the initial learning rate is 1e-4 and the learning rate is dropped by 0.316 every 20 epochs for MSR-VTT dataset. We set the keywords number as 5. The batch size is 64 for both MSR-VTT and MSVD. We adopt the model with best performance on validation set to test. The S2VT and AttLSTM use beam search with size 4 for caption generation. Following the [20], Semantic baseline does not use beam search. Table 1. Performance comparisons with diﬀerent models and baselines on the test set of MSR-VTT in terms of BLEU@4, METEOR, ROUGE L, and CIDEr. CE, Res152, IRV2, I3D, Ca, obj, C3D, RL denote Cross Entropy Loss, 152-layer ResNet, InceptionResNet-V2, Inﬂated 3D, Category embedding vector, object feature detected by Faster RCNN, 3D ConvNets, and Reinforcement learning training strategies respectively. Method Feature PickNet [7] Res152+Ca 38.9 BLEU@4 METEOR ROUGE L CIDEr RL √ 27.2 59.5 42.1 RecNet [8] InceptionV4 39.1 26.6 59.3 42.7 POS [10] IRV2+I3D+Ca 41.3 28.7 62.1 53.4 RMN [9] IRV2+I3D+obj 42.5 28.4 61.6 49.6 SAAT [3] IRV2+C3D+Ca 39.9 27.7 61.2 51.0 HMN [25] IRV2+C3D+obj 43.5 29.0 62.7 51.5 S2VT w/ CE ResNeXt+ECO+Ca 40.8 27.6 59.7 49.8 S2VT w/ CGKT ResNeXt+ECO+Ca 41.4 27.7 60.2 52.0 AttLSTM w/ CE ResNeXt+ECO+Ca 40.6 27.7 59.6 49.6 AttLSTM w/ CGKT ResNeXt+ECO+Ca 41.2 27.6 60.3 51.7 Semantic w/ CE ResNeXt+ECO+Ca 41.3 27.0 60.0 49.3 Semantic w/ CGKT ResNeXt+ECO+Ca 41.0 27.1 60.1 50.8 √ √ Consensus-Guided Keyword Targeting for Video Captioning 4.3 277 Quantitative Results Comparison with the State-of-the-Art. We compare our method with state-of-the-art models on MSR-VTT dataset, including PickNet [7], RecNet [8], POS [10], RMN [9], SAAT [3], and HMN [25]. We adopt S2VT [1], AttLSTM [24], and Semantic [20] with cross-entropy loss as baselines. As shown in Table 1, the CIDEr score of three baselines has been improved by 4.4%, 4.2%, and 3.0% using CGKT loss. Comparing the results between using Cross-Entropy and CGKT as training objectives, we can ﬁnd that CGKT does alleviate the problem of uneven quality annotations in MSR-VTT. Moreover, CGKT makes the results of S2VT, AttLSTM, and Semantic models close to the SOTA. Besides, we also compare performance on the test set of MSVD. As shown in Table 2(a), our method outperforms three baselines on BLEU@4, METEOR, ROUGE L, and CIDEr. And our method achieves an improvement of 8.2%, 1.3%, and 4.0% for three baselines in terms of CIDEr. Table 2. More performance comparisons results. CE, FL, LS, and ΔC denote crossentropy loss, focal loss, label smoothing, and relative improvement on CIDEr, respectively. (a) Performance comparisons with diﬀerent baselines on MSVD. Method BLEU@4 METEOR ROUGE L CIDEr ΔC S2VT w/ CE S2VT w/ CGKT 45.8 49.6 35.6 36.6 70.1 71.1 79.2 85.7 +8.2% AttLSTM w/ CE AttLSTM w/ CGKT 49.5 50.6 36.5 36.7 71.1 71.1 88.2 89.4 +1.3% Semantic w/ CE Semantic w/ CGKT 51.0 51.1 37.1 37.4 71.2 71.7 89.4 93.0 +4.0% (b) Performance comparisons with diﬀerent loss function on MSR-VTT. Method S2VT w/ CE S2VT w/ FL S2VT w/ LS S2VT w/ CGKT BLEU@4 METEOR ROUGE L CIDEr 40.8 40.7 41.6 41.4 27.6 27.8 27.9 27.7 59.7 59.9 60.4 60.2 49.8 50.6 50.8 52.0 ΔC +1.6% +2.0% +4.4% To further prove the eﬀectiveness of our method, we compare our CGKT loss with FocalLoss [19] and LabelSmooth [23]. Following the default setting in [19,23], we set α and γ in FocalLoss to 0.25 and 2, and smoothing parameter to 0.2. As shown in Table 2(b), our CGKT outperforms the other methods on CIDEr. This is reasonable because our method mitigates both sentence-level uneven quality issue and also focus on learning from more representative words, whereas these methods only focus on the word level. 278 P. Ji et al. Fig. 3. Performance comparisons between KTL and CGL on three baselines at test set of MSR-VTT. Ablation Study. To investigate the eﬀectiveness of KTL and CGL, ablation experiments are performed on three models. As shown in Fig. 3, both KTL and CGL can improve the results of baselines, except for Semantic. We consider that this is because Semantic [20] also has the ability to ﬁnd words with more accurate semantic information using Semantic Detection Network, and extra KTL interferes with it. The combination of CGL and KTL does not further improve performance on AttLSTM. We think that is because complicated supervisory signals combined with sentence-level and word-level make additive attention hard to learn eﬀective attention weight. We also studied the eﬀects of diﬀerent quality scores and keyword numbers, as shown in Table 3, using CIDEr as the quality score is the best choice. But 4 keyword number achieves the best performance than 5 on CIDEr, it can be inferred that since the average length of sentences in MSR-VTT is 9.2, 5 keywords is relatively redundant. Table 3. The performance of ablated KTL and CGL with various quality score and keyword number on MSR-VTT. B, M, R and C denote BLEU@4, METEOR, ROUGE L, and CIDEr used as quality score. K@N denotes keyword number is N. Ablation Method BLEU@4 METEOR ROUGE L CIDEr Quality score S2VT w/ B S2VT w/ M S2VT w/ R S2VT w/ C 39.8 41.2 41.7 41.3 26.8 27.6 27.7 27.9 58.9 60.1 60.2 60.2 46.2 50.4 51.0 51.2 41.5 41.8 41.3 41.7 27.3 27.9 27.7 28.0 60.4 60.5 60.0 60.6 50.7 51.6 50.7 51.4 Keywords num S2VT w/ K@3 S2VT w/ K@4 S2VT w/ K@5 S2VT w/ K@6 Consensus-Guided Keyword Targeting for Video Captioning 4.4 279 Qualitative Results In this section, we investigate what content CGKT encourages the model to learn from video. We provide several examples of video captioning results in Fig. 4. By comparing captions generated by S2VT trained with Cross-Entropy loss and our CGKT loss, we can see that generated captions equipped with our method have a more complete structure and contain more representative words. For example, for the upper left video in Fig. 4, model trained by CGKT generates “knocking on the wall”, which is more close to the content of the video. Fig. 4. Qualitative comparison of S2VT using Cross-Entropy and our CGKT as training objectives on the test set of MSVD and MSR-VTT datasets. 5 Conclusion In this paper, we propose Consensus-Guided Keyword Targeting (CGKT) to alleviate the uneven sentence quality and imbalanced word distribution problems in video captioning datasets. Our approach re-weights training samples according to consensus-based sentence scores and word frequencies and thus encourages the caption model to learn high-quality sentences and keywords. Experiments on MSVD and MSR-VTT demonstrate the eﬀectiveness and versatility of our CGKT, which can be easily integrated into various video captioning models to bring consistent improvements (especially the CIDEr metric) and produce more accurate and detailed descriptions. 280 P. Ji et al. Acknowledgements. This paper was partially supported by NSFC (No: 62176008) and Shenzhen Science and Technology Research Program (No: GXWD202012311658 07007-20200814115301001). References 1. Darrell, T., Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Saenko, K.: Sequence to sequence - video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015) 2. Zhang, Y., Xu, J., Yao, T., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 537–545 (2017) 3. Wang, C., Zheng, Q., Tao, D.: Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13093–13102 (2020) 4. Malkarnenkar, G., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013) 5. Yao, T., Xu, J., Mei, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016) 6. Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 171–184 (2002). https://doi.org/10.1023/A:1020346032608 7. Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 367–384. Springer, Cham (2018). https://doi. org/10.1007/978-3-030-01261-8 22 8. Zhang, W., Wang, B., Ma, L., Liu, W.: Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7622–7631 (2018) 9. Wang, M., Tan, G., Liu, D., Zha, Z.: Learning to discretely compose reasoning module networks for video captioning. In: Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence, pp. 745–752 (2020) 10. Zhang, W., Jiang, W., Wang, J., Wang, B., Ma, L., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2650 (2019) 11. Jin, Q., Chen, S., Chen, J., Hauptmann, A.: Video captioning with guidance of multimodal latent topics. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1838–1846 (2017) 12. Ward, T., Papineni, K., Roukos, S., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002) 13. Denkowski, M.J., Lavie, A.: Meteor universal: language speciﬁc translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014) 14. Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the ACL Workshop: Text Summarization Branches Out, p. 10 (2004) Consensus-Guided Keyword Targeting for Video Captioning 281 15. Zitnick, C.L., Vedantam, R., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015) 16. Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data 6(1), 1–54 (2019). https://doi.org/10.1186/s40537-019-0192-5 17. Shi, J., Feng, H., Ouyang, W., Pang, J., Chen, K., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830 (2020) 18. Li, Y., Vasconcelos, N.: REPAIR: removing representation bias by dataset resampling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9564–9573 (2019) 19. Girshick, R.B., He, K., Lin, T., Goyal, P., Dollar, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 318–327 (2020) 20. Maye, A., Li, J., Chen, H., Ke, L., Hu, X.: A semantics assisted video captioning model trained with scheduled sampling. Front. Robot. AI 7, 475767 (2020) 21. Zolfaghari, M., Singh, K., Brox, T.: ECO: eﬃcient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018). https://doi. org/10.1007/978-3-030-01216-8 43 22. Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987–5995 (2017) 23. Ioﬀe, S., Shlens, J., Szegedy, C., Vanhoucke, V., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 24. Yao, L., et al.: Describing videos by exploiting temporal structure. In: International Conference on Computer Vision, pp. 4507–4515 (2015) 25. Ye, H., Li, G., Qi, Y., Wang, S., Huang, Q., Yang, M.: Hierarchical modular network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)