◆ 张晨, 丛润民, 林秦伟, 马林, 李锋, 赵耀, Kwong, “Cross-modality discrepant interaction network for RGB-D salient object detection,” ACM MM 2021, pp. 2094-2102, Oct. 2021..pdf
Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China Cross-modality Discrepant Interaction Network for RGB-D Salient Object Detection Chen Zhang1,2 , Runmin Cong1,2,4,∗ , Qinwei Lin1 , Lin Ma3 , Feng Li1,2 , Yao Zhao1,2 , Sam Kwong4 1 Institute of Information Science, Beijing Jiaotong University, Beijing, China 2 Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China 3 Meituan, Beijing, China 4 City University of Hong Kong, Hong Kong, China {chen.zhang,rmcong,qinweilin,l1feng,yzhao}@bjtu.edu.cn,forest.linma@gmail.com,cssamk@cityu.edu.hk ABSTRACT The popularity and promotion of depth maps have brought new vigor and vitality into salient object detection (SOD), and a mass of RGB-D SOD algorithms have been proposed, mainly concentrating on how to better integrate cross-modality features from RGB image and depth map. For the cross-modality interaction in feature encoder, existing methods either indiscriminately treat RGB and depth modalities, or only habitually utilize depth cues as auxiliary information of the RGB branch. Different from them, we reconsider the status of two modalities and propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD, which differentially models the dependence of two modalities according to the feature representations of different layers. To this end, two components are designed to implement the effective cross-modality interaction: 1) the RGB-induced Detail Enhancement (RDE) module leverages RGB modality to enhance the details of the depth features in low-level encoder stage. 2) the Depth-induced Semantic Enhancement (DSE) module transfers the object positioning and internal consistency of depth features to the RGB branch in high-level encoder stage. Furthermore, we also design a Dense Decoding Reconstruction (DDR) structure, which constructs a semantic block by combining multi-level encoder features to upgrade the skip connection in the feature decoding. Extensive experiments on five benchmark datasets demonstrate that our network outperforms 15 stateof-the-art methods both quantitatively and qualitatively. Our code is publicly available at: https:// rmcong.github.io/ proj_CDINet.html. RGB CNN RGB CNN RGB CNN Depth CNN Depth CNN Depth CNN (a) RGB (b) Depth (c) ATSA GT ASIF Ours (d) Figure 1: Mode (a) is a unidirectional interaction which regards the depth stream as auxiliary information; Mode (b) represents an undifferentiated bidirectional interaction, and it treats two modalities as equal; Mode (c) is the proposed mode which conducts discrepant cross-modality guidance. In (d), ATSA [39], ASIF [25] and Ours correspond to method of modes (a), (b) and (c) in a difficult scene, respectively. CCS CONCEPTS ACM Reference Format: Chen Zhang, Runmin Cong, Qinwei Lin, Lin Ma, Feng Li, Yao Zhao, Sam Kwong. 2021. Cross-modality Discrepant Interaction Network for RGBD Salient Object Detection. In Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3474085.3475364 • Computing methodologies → Interest point and salient region detections;. 1 INTRODUCTION Inspired by the human visual attention mechanism, salient object detection (SOD) aims to detect the most attractive objects or regions in a given scene, which has been successfully applied to abundant tasks [1, 6, 8, 11, 42]. In fact, in addition to the color appearance, texture detail, and physical size, people can also perceive the depth of field, thereby generating the stereo perception through the binocular vision system. In recent years, thanks to the rapid development of consumer depth cameras such as Microsoft Kinect, we are able to conveniently acquire the depth map to depict a scene. Compared with RGB image which provides rich color and texture information, depth map can exhibit the geometric structure, internal consistency, and illumination invariance. With the help of depth map, the SOD model can better cope with some challenging scenes, such as low-contrast and complex background. Therefore, for the past few years, the research on salient object detection of RGB-D images KEYWORDS Salient object detection, RGB-D images, discrepant interaction, dense decoding reconstruction ∗ Corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM ’21, October 20–24, 2021, Virtual Event, China © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8651-7/21/10. . . $15.00 https://doi.org/10.1145/3474085.3475364 2094 Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China in the feature decoding, the existing SOD models introduce the encoder features through skip connection [28, 30]. However, they only introduce the information of the corresponding encoder layer through direct addition or concatenation operation, which do not make full use of the encoder features of different layers. To tackle this problem, we propose a dense decoding reconstruction (DDR) structure, which generates a semantic block by densely connecting the higher-level encoding features to provide more comprehensive semantic guidance for the skip connection in the feature decoding. Furthermore, we can obtain a more accurate and complete saliency prediction. As can be seen from the visualization results in Figure 1(d), the ATSA [39] method (unidirectional interaction) cannot well suppress the interference caused by the depth map, while the ASIF [25] method ( undifferentiated bidirectional interaction) fails to completely detect the salient object. By contrast, our proposed CDINet can accurately detect the salient object with complete structure and clear background. The main contributions of this paper are summarized as follows: has received widespread attention. As we all know, RGB image and depth map belong to different modalities, thus we need some sophisticated designs to better use of the advantages of both to achieve RGB-D SOD. Therefore, the pivotal and hot issue in RGB-D SOD is how to better integrate cross-modality features. The limited expression ability of traditional models based on hand-crafted features [7, 9, 10, 12, 16, 22, 31] makes their performance always unsatisfactory, especially in complex scenes. With the popularity of deep learning in recent years, plenty of powerful cross-modality integration methods based on convolutional neural network (CNN) have been proposed [4, 18, 21, 26, 28, 30, 40, 43]. For the information interaction between RGB and depth modalities in feature encoder, the existing mainstream methods can be roughly divided into two categories according to their interaction directions: (i) Unidirectional interaction mode shown in Figure 1(a), which uses the depth cues as auxiliary information to supplement the RGB branch [4, 28, 33]. (ii) Undifferentiated bidirectional interaction shown in Figure 1(b), which treats RGB and depth cues equally to achieve cross-modality interaction [18, 21, 30]. However, in this paper, we raise a new question: since these two modalities have their own strong points, can we design a discrepant interaction mode for RGB-D SOD based on their roles to make full use of the advantages of both? Observing the Figure 1(d), we may be able to find the answer: (1) Depth map has relatively distinct details (such as boundaries) for describing the salient objects, which is beneficial to straightforwardly learn the effective saliency-oriented features. However, when it comes to some special scenes, depth map can not distinguish different object instances at the same depth level only by virtue of its own characteristics, such as the bottle and fingers in Figure 1(d). At this point, the RGB branch can use its rich appearance detail information to enhance the depth feature learning. (2) More affluent semantic information can be extracted from RGB image than the depth map, but the complex background interference or illumination variation influence may cause the salient objects to be flawed. By contrast, the depth features can provide better guidance for salient object positioning and internal consistency, thereby enhancing the semantic representation of the RGB modality. Based on the above observations, we propose a cross-modality discrepant interaction network (CDINet) for RGB-D salient object detection, which clearly models the dependence of two modalities according to the manifestations of features in different layers. Specifically, we first employ an RGB-induced detail enhancement (RDE) module in the first two layers of the encoding network, which can supplement more detailed information to enhance the low-level depth features. Then, a depth-induced semantic enhancement (DSE) module is designed for high-level feature encoding, which utilizes the saliency positioning and internal consistency of high-level depth cues to enhance the RGB semantics. Thanks to this differentiated interaction mode, RGB and depth branches can complement each other, give full play to their respective advantages, and finally generate more accurate semantic representations. In addition, for the encoder-decoder network, with the deepening of the convolution process, we can obtain global semantic representation in the encoding stage, but some spatial details will be lost, thereby only utilizing the supervision of ground truth in the decoder stage cannot achieve a perfect reconstruction result. In order to highlight and restore the spatial domain information • We propose an end-to-end Cross-modality Discrepant Interaction Network (CDINet), which differentially models the dependence of two modalities according to the feature representations of different layers. Our network achieves competitive performance against 15 state-of-the-art methods on 5 RGB-D SOD datasets. Moreover, the inference speed for an image reaches 42 FPS. • We design an RGB-induced Detail Enhancement (RDE) module to transfer detail supplement information from RGB modality to depth modality in low-level encoder stage, and a Depth-induced Semantic Enhancement (DSE) module to assist RGB branch in capturing clearer and fine-grained semantic attributes by utilizing the advantage of positioning accuracy and internal consistency of high-level depth features. • We design a Dense Decoding Reconstruction (DDR) structure, which generates a semantic block by leveraging multiple high-level encoder features to upgrade the skip connection in the feature decoding. 2 RELATED WORK RGB-D Salient Object Detection. In recent years, a large number of RGB-D SOD models based CNN have shown extraordinary performance, and these works focus on how to design better integration strategies. However, most researchers are accustomed to treat the two modalities as equal, and we call them undifferentiated bidirectional interaction mode. For example, Fu et al. [18] introduced a siamese network for joint learning and designed a densely-cooperative fusion strategy to discover complementary features. Pang et al. [30] integrated the cross-modal features through densely connected structure, then established a hierarchical dynamic filtering network by using fusion features. Huang et al. [21] proposed a cross-modal refinement module to integrate cross-modal features, then a multi-level fusion module was designed to fuse the features of each level followed bottom-up pathway. In addition, some methods take depth map as auxiliary information for RGB branch, forming the unidirectional interaction mode. Piao et al. [33] proposed a depth distiller to transfer the depth knowledge from 2095 Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China depth stream to RGB stream. Liu et al. [28] designed a residual fusion module to integrate the depth decoding features into RGB branch in decoding stage. Chen et al. [4] considered depth map contains much less information than RGB image, then proposed a lightweight network to extract depth stream features. Different from the above methods, we reconsider the status of RGB modality and depth modality in RGB-D SOD task, and propose a discrepant interaction structure to achieve more oriented and differentiated cross-modality interaction. Attention Mechanism. As a form of efficient resource allocation, attention mechanism has been widely applied to plenty of computer vision tasks. Spatial attention mechanism [38] makes the network pay attention to the area of interest. Channel attention mechanism [20] learns the importance of each channel. Selfattention mechanism [36] captures long-distance dependency relationship. There are also several works that develop mixed attention mechanisms, such as CBAM [37] and dual-attention [17]. In this paper, we employ spatial-wise and channel-wise attention in RDE and DSE modules. Moreover, we focus more on the cross-modality application of attention, that is, the attention map generated by one modality is utilized to enhance another modality features, so as to achieve more effective cross-modality guidance in the form of attention. Skip Connection. Long-range skip connection is a measure to recover image details in pixel-level prediction tasks, and it has been equipped with almost all RGB-D SOD models. For models where cross-modal interaction occurs in encoder, skip connection is presented as a direct feature-wise addition or concatenation, i.e., [28] and [30]. For other networks which fuse cross-modality features in decoder, proprietary modules are often designed to incorporate skip features (also known as side outputs). For example, Li et al.[26] proposed a cross-modality modulation and selection block to fuse side outputs in a coarse-to-fine way. Piao et al.[32] designed a depth refinement block to integrate complementary multi-level RGB and depth features. In this work, we boost performance by a simple but effective decoding structure that densely connects the higher-level encoder features to conduct skip connection. 3 explicitly model the dependence of two modalities in the encoder stage according to the feature representations of different layers, which selectively utilizes RGB features to supplement the details for depth branch, and transfers the depth features to RGB modality to enrich the semantic representations. Architecture. Figure 2 illustrates the overall architecture of the proposed CDINet for RGB-D SOD task. It is composed of three parts, i.e., the RGB-induced detail enhancement (RDE) module, the depth-induced semantic enhancement (DSE) module, and the dense decoding reconstruction (DDR) structure. On the whole, the network follows an encoder-decoder architecture, including two encoders regarding RGB and depth modalities and one decoder. The two encoders both adopt VGG16 [35] network, discarding the last pooling and fully-connected layers, as the backbone to extract the corresponding multi-level feature representations and achieve cross-modality information interaction. The extracted RGB and depth features from the backbone are denoted as 𝑓𝑟𝑖 and 𝑓𝑑𝑖 respectively, where 𝑟 and 𝑑 represent the RGB and depth branches, and 𝑖 ∈ {1, 2, ..., 5} indexes the feature level. Specifically, in the low-level feature encoding stage (i.e., the first two layers of backbone), we design an RDE module to transfer detail supplement information from RGB modality to depth modality, thereby enhancing the distinguishability representation of depth features. For the high-level encoding features, the DSE module utilizes the advantage of positioning accuracy and internal consistency of depth features to assist RGB branch in capturing clearer and fine-grained semantic attributes, thereby promoting the object structure and background suppression. Besides, for the convolution-upsample decoding infrastructure, we upgrade the traditional skip connection way by constructing a DDR structure, that is, utilizing higher-level skip connection features as guidance information to achieve more effective encoder information transmission. The prediction result generated by the last convolutional layer of decoder will be used as the final saliency output. 3.2 RGB-induced Detail Enhancement Compared with the RGB image, depth map puts aside complex texture information and can intuitively describe the shape and position of the salient objects. In this way, for the low-level encoder features that contain more detailed information (such as boundaries and shapes), depth features can provide more straightforward and instructive representations than RGB features, which are beneficial to the initial feature learning. However, depth information is not a panacea. For example, different object instances adjacent to each other have the same depth value, such as the bottle and fingers in Figure 1(d), which makes them manifest as indivisible objects in the depth map. But in the corresponding RGB image, these objects can be distinguished by the color difference in most cases. Hence, these ambiguous regions burden the network training, and previous models have confirmed the difficulty of predicting such samples. To address this dilemma, we design an RGB-induced detail enhancement module to reinforce and supplement depth modality through the RGB features in low-level layers. By introducing the detail guidance of RGB branch at an early stage, more information can be used in the feature feedforward process to handle these PROPOSED METHOD 3.1 Overview Motivation. Previous studies have confirmed the positive effect of RGB and depth information interaction in SOD tasks [25, 26, 28]. In this paper, we seriously reconsider the status of RGB modality and depth modality in RGB-D SOD task. Different from the previous unidirectional interaction shown in Figure 1(a) and undifferentiated bidirectional interaction shown in Figure 1(b), we believe that the interaction of the two modalities information should be carried out in a separate and discrepant manner. The low-level RGB features can help the depth features to distinguish different object instances at the same depth level, while the high-level depth features can further enrich the RGB semantics and suppress background interference. Therefore, a perfect RGB-D SOD model should give full play to the advantages of each modality, and simultaneously utilize another modality to make up for itself to avoid causing interference. To this end, we propose a cross-modality discrepant interaction network to 2096 Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China C Conv7灤7 Conv3灤3 Conv1灤1 Concatenate Upsampling Element-wise Multiplication Sigmoid f d4 C f i+1 out i i f out Fskip i f out Depth Encoder Depth Supervision 5 f out f d5 conv5 f d3 conv4 2 f out conv3 1 f out conv2 256 conv1 256 Element-wise Addition GT RDE RDE DSE B DSE DSE B B B B conv5 conv4 conv3 256 conv2 5 f out DSE (Depth-induced Semantic Enhancement) RDE (RGB-induced Detail Enhancement) C灅H灅W C灅H灅W 2C灅H灅W C灅H灅W f pool C灅H灅W fd Prediction Decoder C灅H灅W fr conv1 5 f out conv2 4 f out conv3 RGB 3 f out f r2 RGB Encoder conv4 f r1 conv5 conv1 256 1灅H灅W 1灅H灅W fr fd C灅H灅W FC Average Pooling C灅1灅1 Cweight 1灅H灅W f out Global f out S weight Attention Level Feature Level C灅H灅W Max Pooling fd along Channel C灅H灅W Cascaded Attention Mask Figure 2: The overall pipeline of the proposed CDINet. Our CDINet follows an encoder-decoder architecture, which realizes the discrepant interaction and guidance of cross-modality information in the encoding stage. The framework mainly consists of three parts: 1) RGB-induced detail enhancement module. It achieves depth feature enhancement by transmitting the detailed supplementary information of the RGB modality to the depth modality. 2) Depth-induced semantic enhancement module. Depth features provide better positioning and internal consistency to enrich the semantic information of RGB features. 3) Dense decoding reconstruction structure. It densely encodes the encoder features of different layers to generate more valuable skip connection information, which is shown in the top right box marked B of this figure. The backbone of our network in this figure is VGG16 [35], and the overall network can be trained efficiently as an end-to-end system. hard cases. The detailed architecture is shown in the bottom left of Figure 2. To be specific, we first adopt two cascaded convolutional layers to fuse the underlying visual features of two modalities. The first convolutional layer with the kernel size of 1 × 1 is used to reduce the number of feature channels, and the second convolutional layer with the kernel size of 3 × 3 achieves more comprehensive feature fusion, thereby generating the fusion feature pool 𝑓𝑝𝑜𝑜𝑙 : 𝑖 = 𝑐𝑜𝑛𝑣 3 (𝑐𝑜𝑛𝑣 1 ([𝑓𝑟𝑖 , 𝑓𝑑𝑖 ])), 𝑓𝑝𝑜𝑜𝑙 to the depth branch is that the common detail features of two modalities can be enhanced and irrelevant features can be weakened in this process. Then, in order to cogently provide the useful information required by the depth features, we need to further filter the RGB features from the depth perspective. We use a series of operations on the depth features, including a maxpooling layer, two convolutional layers, and a sigmoid function, to generate a spatial attention mask as suggested by [38]. Note that for the two serial convolutional layers, we use a larger convolution kernel size (i.e., 7 × 7) to perceive the important detail regions in a large receptive field. Finally, multiplying the mask and feature pool 𝑓𝑝𝑜𝑜𝑙 to reduce the introduction of irrelevant RGB features, thereby obtaining the required supplement information from the perspective of depth modality. The entire (1) where 𝑖 ∈ {1, 2} indexes the low-level encoder feature layer, [·, ·] denotes the channel-wise concatenation operation, and 𝑐𝑜𝑛𝑣𝑛 (·) is a convolutional layer with the kernel size of 𝑛 ×𝑛. The advantage of generating 𝑓𝑝𝑜𝑜𝑙 instead of transferring the RGB features directly 2097 Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China 𝑖 denotes the attention-level RGB enhanced features. where 𝐷𝑎𝑡𝑡 As for guidance at the feature level, we use the pixel-wise addition operation to directly fuse the features of two modalities, which can strengthen the internal response of salient objects and obtain better internal consistency. It should be noted that we use cascaded channel attention [20] and spatial attention [38] mechanisms to enhance the depth features and produce the feature-level enhanced 𝑖 . Therefore, the features that eventually flow into the features 𝐷𝑎𝑑𝑑 next layer of the RGB branch can be expressed as: process can be described as: 𝑖 𝑖 𝑓𝑜𝑢𝑡 = 𝜎 (𝑐𝑜𝑛𝑣 7 (𝑐𝑜𝑛𝑣 7 (𝑚𝑎𝑥𝑝𝑜𝑜𝑙 (𝑓𝑑𝑖 )))) 𝑓𝑝𝑜𝑜𝑙 + 𝑓𝑑𝑖 , (2) where 𝑚𝑎𝑥𝑝𝑜𝑜𝑙 (·) and 𝜎 (·) denote the maxpooling operation along channel dimension and sigmoid function respectively, and rep𝑖 (𝑖 ∈ {1, 2}) resents element-wise multiplication. The features 𝑓𝑜𝑢𝑡 will be used as the input of the next layer in depth branch. Note that, since the detail features in the depth branch are more intuitive and distinct, we choose them as skip connection features in the first two layers for decoding. 𝑖 𝑖 𝑖 = 𝐷𝑎𝑡𝑡 + 𝐷𝑎𝑑𝑑 . 𝑓𝑜𝑢𝑡 3.3 Depth-induced Semantic Enhancement 𝑖 (𝑖 ∈ {3, 4, 5}) of RGB branch will Again, the enhanced features 𝑓𝑜𝑢𝑡 be introduced into the decoder stage to achieve saliency decoding reconstruction. In the high-level layers of encoder stage, the learned features of the network contain more semantic information, such as categories and relationships. For an RGB image, because it contains rich color appearance and texture content, its semantic information is also more comprehensive than depth modality. However, because of the relatively simple structure and data characteristics of the depth map, the learned high-level semantic features have better salient object positioning, especially in the suppression of the background regions, which is exactly what RGB high-level semantics require. Therefore, we design the depth-induced semantic enhancement module in the high-level encoder stage to enrich the RGB semantic features with the help of the depth modality. However, considering the simple fusion strategies (e.g., direct addition or concatenation) cannot effectively integrate cross-modality features. In this paper, we employ two types of interactive patterns to roundly carry out cross-modality features fusion, i.e., attention level and feature level. The detailed architecture is shown in the bottom right of Figure 2. First, we learn an attention vector 𝑆 𝑤𝑒𝑖𝑔ℎ𝑡 ∈ R1×ℎ×𝑤 from the depth features to guide RGB modality to focus on the region of interest in a spatial attention [38] manner, where the ℎ and 𝑤 represent the height and width of feature map, respectively. On the one hand, it helps to reinforce salient regions that are already recognized. On the other hand, it also allows the RGB branch to focus on information that is being ignored or incorrectly emphasized. This process can be formulated as: 𝑆 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝜎 (𝑐𝑜𝑛𝑣 3 (𝑚𝑎𝑥𝑝𝑜𝑜𝑙 (𝑓𝑑𝑖 ))), (3) 𝑓𝑟𝑠𝑖 = 𝑆 𝑤𝑒𝑖𝑔ℎ𝑡 𝑓𝑟𝑖 , (4) 3.4 𝑖+1 5 ), ..., 𝑢𝑝 (𝑓𝑠𝑘𝑖𝑝 )])), 𝐵𝑖 = 𝑐𝑜𝑛𝑣 3 (𝑐𝑜𝑛𝑣 1 ([𝑢𝑝 (𝑓𝑠𝑘𝑖𝑝 𝑖 = 𝐶 𝑤𝑒𝑖𝑔ℎ𝑡 𝑓𝑟𝑠𝑖 , 𝐷𝑎𝑡𝑡 (6) (8) where 𝑢𝑝 (·) denotes the up-sampling operation via bilinear inter𝑗 polation, which reshapes 𝑓𝑠𝑘𝑖𝑝 (𝑖 < 𝑗 ≤ 5) to same resolution with represents the RGB encoder features enhanced by the spatial attention of depth features 𝑓𝑑𝑖 , and 𝑖 ∈ {3, 4, 5} indexes the high-level encoder feature layer. In addition, high-level features usually have abundant channels, so we use the channel attention [20] to model the importance relationship of different channels and learn more discriminative features. Concretely, we learn the weight vector 𝐶 𝑤𝑒𝑖𝑔ℎ𝑡 ∈ R𝑐×1×1 through a global average pooling (𝐺𝐴𝑃) layer, two fully connected layers (𝐹𝐶) and a sigmoid function, in which the 𝑐 denotes the number of channels in feature map. The final attention-level guidance is formulated as: (5) Dense Decoding Reconstruction In the feature encoding stage, we learn the multi-level discriminative features through the discrepant guidance and interaction. The decoder is dedicated to learning the saliency-related features and predicting the full-resolution saliency map. During the feature decoding, skip connection that introduces encoding features into the decoder has been widely used in the existing SOD models [5, 24, 28, 30]. However, these methods only establish the relationship between the corresponding encoding and decoding layers, but ignore the different positive effects of different encoding features. For example, the top-layer encoding features can provide semantic guidance for each decoding layer. Consequently, we design a dense decoding reconstruction structure to more fully and comprehensively introduce skip connection guidance. In Figure 2, we show the specific implementation plan in the top right box. 𝑖 of each layer in the encoding stage conTo be specific, the 𝑓𝑜𝑢𝑡 stitutes a skip connection features list. For convenience of distinc𝑖 tion, we remark them as the skip connection features 𝑓𝑠𝑘𝑖𝑝 (𝑖 ∈ {1, 2, 3, 4, 5}). Then, before the combination of decoding features and skip connection features of each layer, we densely connect the higher-level encoder features to generate a semantic block 𝐵, which is used to constrain the introduction of the skip connection information of the current corresponding encoder layer. The semantic block 𝐵 is defined as follows: where 𝑓𝑑𝑖 denotes the high-level encoder features of the depth branch, 𝑓𝑟𝑖 is the high-level RGB features from the backbone, 𝑓𝑟𝑠𝑖 𝐶 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝜎 (𝐹𝐶 (𝐺𝐴𝑃 (𝑓𝑟𝑠𝑖 ))), (7) 𝑖 . 𝑓𝑠𝑘𝑖𝑝 Then, with the semantic block, we adopt element-wise multiplication to eliminate redundant information, and a residual connection to preserve the original information, thereby generating the final 𝑖 skip connection features 𝐹𝑠𝑘𝑖𝑝 : 𝑖 𝑖 𝑖 = 𝐵𝑖 𝑓𝑠𝑘𝑖𝑝 + 𝑓𝑠𝑘𝑖𝑝 , 𝐹𝑠𝑘𝑖𝑝 (9) 𝑖 where 𝑓𝑠𝑘𝑖𝑝 denotes the current corresponding skip connection features. In this dense way, the higher-level encoder features work as a semantic filter to achieve more effective information selection of skip connection features, thereby effectively suppressing redundant information that may cause anomalies in the final saliency 2098 Poster Session 3 RGB Depth MM ’21, October 20–24, 2021, Virtual Event, China GT Ours CPFP DMRA FRDT A2dele JL-DCF S2MA SSF cmMS DANet PGAR BiANet D3Net ASIFNet Figure 3: Visual comparisons with other state-of-the-art RGB-D methods in some representative scenes. 𝑖 prediction. The obtained 𝐹𝑠𝑘𝑖𝑝 is combined with the decoding features of previous layer to gradually restore image details through up-sampling and successive convolution operations. Finally, the decoding features of the last layer are used to generate the predicted saliency map via a sigmoid activation. 4 MAE score [6] calculates the difference pixel by pixel between the saliency map 𝑆 and ground truth 𝐺: 𝑀𝐴𝐸 = EXPERIMENTS 4.1 Datasets and Evaluation Metrics Benchmark Datasets. Five popular RGB-D SOD benchmark datasets are employed to evaluate the performance of the proposed model. The NLPR dataset [31] obtained by a Microsoft Kinect contains 1000 pairs of RGB images and depth maps in indoor and outdoor locations. The NJUD dataset [22] consists of 1985 RGB images and corresponding depth maps, which are collected from 3D movies, the Internet, and stereo photographs. The DUT dataset [32] includes 1200 indoor and outdoor complex scenes paired with corresponding depth maps. The STEREO dataset [29] collects 797 stereoscopic images from Image gallery on the web, and obtains the corresponding depth maps through left-right view estimation. The LFSD dataset [27] includes 100 RGB-D images via a light field camera. Following [26, 33], we adopt 2985 images as our training data, including 1485 samples from NJUD, 700 samples from NLPR, and 800 samples from DUT, and all the remaining images are used as testing. Evaluation metrics. We adopt four commonly used metrics in SOD task to quantitatively evaluate the performance. P-R curve describes the relationship between precision and recall, and the closer to the upper right, the better the algorithm performance. F-measure [29] indicates the weighted harmonic average of precision and recall by comparing the binary saliency map with ground truth: 𝐹𝛽 = (𝛽 2 + 1) · 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 · 𝑅𝑒𝑐𝑎𝑙𝑙 , 𝛽 2 · 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (10) where 𝛽 2 is set to 0.3 to emphasize the precision. 𝐻 𝑊 1 |𝑆 (𝑥, 𝑦) − 𝐺 (𝑥, 𝑦)|, 𝐻 × 𝑊 𝑦=1 𝑥=1 where 𝐻 and 𝑊 are the height and width of the original image, respectively. S-measure [14] evaluates the object-aware (𝑆𝑜 ) and region-aware structural (𝑆𝑟 ) similarity between the predicted saliency map and ground truth, which is defined as: 𝑆 = 𝛼 ∗ 𝑆𝑜 + (1 − 𝛼) ∗ 𝑆𝑟 , (12) where 𝛼 is set to 0.5 for balancing the contributions of two terms. 4.2 Implementation Details We implement the proposed network using Pytorch framework and is accelerated by an NVIDIA GeForce RTX 2080Ti GPU, and also implement our network by using the MindSpore Lite tool1 . All the training and testing images are resized to 256 × 256, and the depth map is simply copied to three channels as input. During training, we initialize the parameters of backbone by the pre-trained model on ImageNet [13], and the other filters are initialized as the Pytorch default settings. Then, to avoid overfitting, we use random flipping and rotating to augment the training samples. Moreover, we apply the usual binary cross-entropy loss function to optimize the proposed network, and the Adam algorithm is used to optimize our network with the batch size of 4 and the initial learning rate of 1e-4 which is divided by 5 every 40 epochs. The model can be trained in an end-to-end manner without any pre-processing (e.g., HHA [19] for depth map) or post-processing (e.g., CRF [23]) techniques. It takes about 5 hours to obtain the final model for 100 epochs. When testing, the inference time for an image with size of 256 × 256 is 0.023 second (42 FPS) via the aforementioned GPU. 1 https://www.mindspore.cn/ 2099 (11) Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China Table 1: Quantitative comparison results in terms of S-measure (𝑆𝛼 ), max F-measure (𝐹 𝛽 ) and MAE score on five benchmark datasets. ↑ & ↓ denote higher and lower is better, respectively. Bold number on each line represents the best performance. LFSD STEREO DUT NJUD NLPR MMCI TAN CPFP DMRA FRDT SSF S2MA A2dele JL-DCF PGAR DANet cmMS BiANet D3Net ASIFNet CDINet [3] [2] [44] [32] [41] [40] [28] [33] [18] [4] [45] [26] [43] [15] [25] Ours 2019 2019 2019 2019 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2021 - PR TIP CVPR ICCV ACM MM CVPR CVPR CVPR CVPR ECCV ECCV ECCV TIP TNNLS TCyb - 𝐹𝛽 ↑ .8149 .8631 .8675 .8749 .8976 .8986 .9017 .8815 .8915 .9153 .9013 .9031 .8764 .8969 .8907 .9162 𝑆𝛼 ↑ .8557 .8861 .8884 .8892 .9129 .9141 .9155 .8979 .9097 .9297 .9152 .9176 .9000 .9117 .9079 .9273 𝑀𝐴𝐸 ↓ .0591 .0410 .0359 .0339 .0290 .0259 .0298 .0285 .0295 .0245 .0283 .0277 .0325 .0296 .0295 .0240 𝐹𝛽 ↑ .8526 .8741 .7661 .8883 .8982 .9000 .8888 .8733 .9042 .9068 .8927 .9034 .9121 .8996 .8886 .9215 𝑆𝛼 ↑ .8588 .8785 .7984 .8804 .8992 .9002 .8943 .8704 .9022 .9089 .8971 .9051 .9119 .9002 .8902 .9188 𝑀𝐴𝐸 ↓ .0789 .0605 .0794 .0521 .0467 .0422 .0532 .0510 .0413 .0422 .0463 .0432 .0399 .0465 .0472 .0354 𝐹𝛽 ↑ .7671 .7903 .7180 .8975 .9263 .9242 .8997 .8923 .8612 .9171 .8954 .9090 .8156 .7855 .8245 .9372 𝑆𝛼 ↑ .7913 .8083 .7490 .8879 .9159 .9157 .9031 .8864 .8758 .9136 .8894 .9070 .8368 .8152 .8396 .9274 𝑀𝐴𝐸 ↓ .1126 .0926 .0955 .0477 .0362 .0340 .0440 .0426 .0556 .0372 .0465 .0405 .0745 .0848 .0724 .0302 𝐹𝛽 ↑ .8425 .8705 .8601 .8861 .8987 .8903 .8158 .8864 .8740 .9008 .8199 .8971 .8844 .8495 .8800 .9033 𝑆𝛼 ↑ .8559 .8775 .8714 .8858 .9004 .8920 .8424 .8868 .8855 .9054 .8410 .8999 .8882 .8687 .8820 .9055 𝑀𝐴𝐸 ↓ .0796 .0591 .0537 .0474 .0428 .0449 .0746 .0431 .0509 .0422 .0712 .0429 .0497 .0578 .0485 .0410 𝐹𝛽 ↑ - - .8214 .8523 .8555 .8626 .8310 .8280 .8217 .8390 .8417 .8623 .7287 .8062 .8602 .8746 𝑆𝛼 ↑ - - .8199 .8393 .8498 .8495 .8292 .8258 .8171 .8444 .8375 .8491 .7422 .8167 .8520 .8703 𝑀𝐴𝐸 ↓ - - .0953 .0830 .0809 .0751 .1018 .0839 .1031 .0818 .1031 .0792 .1340 .1023 .0809 .0631 4.3 Comparisons with SOTA Methods We compare our CDINet with 15 state-of-the-art CNN-based RGB-D SOD methods, including MMCI [3], TAN [2], CPFP [44], DMRA [32], FRDT [41], SSF [40], S2MA [28], A2dele [33], JL-DCF [18], DANet [45], PGAR [4], cmMS [26], BiANet [43], D3Net [15], and ASIFNet [25]. For fair comparisons, we test these methods with the released codes under the default settings to obtain the saliency maps. As for the models without released codes, we directly use the saliency maps provided by the authors for comparison. Quantitative evaluation. Table 1 objectively indicates the quantitative comparison results in terms of three evaluation metrics on five datasets. It can be seen that our network outperforms all compared methods on these five datasets, except for the S-measure on the NLPR dataset. For example, compared with the second best method on the DUT dataset, the minimum percentage gain reaches 1.2% for max F-measure, 1.3% for S-measure, and 11.2% for MAE score. On the NJUD dataset, compared with the BiANet [43] (the second best method), our proposed CDINet has a 1.0% improvement for max F-measure, 0.8% improvement for S-measure, and 11.3% improvement for MAE score. On the LFSD dataset, compared with the second best method, the S-measure is improved from 0.8520 to 0.8703, with the percentage gain of 2.1%. Limited by the page space, the P-R curves of different methods on five datasets are shown in the supplementary materials. Qualitative comparison. In order to more intuitively demonstrate the excellent performance of the proposed method, we provide some 2100 qualitative comparison results in Figure 3. As we can see in this figure, our model achieves better visual effects in many challenging scenarios, such as small objects (i.e., the first image), multiple objects (i.e., the second image), and disturbing backgrounds (i.e., the fifth image). Meanwhile, our method not only accurately detects salient objects, but also obtains better internal consistency. For example, in the second image, although other methods can detect multiple windows, they cannot guarantee good structural integrity and internal consistency of salient objects, while our CDINet does it, benefiting from the proposed DSE module. In addition, the dense decoding reconstruction structure also makes the decoding process more refined, and the object boundary is sharper (e.g., the last image). For the confusing depth map (e.g., the fourth image), most methods result in redundant areas or vague predictions, yet our method can effectively suppress these ambiguous regions. Table 2: Ablation analyses of different components on the NLPR and DUT datasets. models CDINet w/o RDE w/o DSE w/o DDR 𝐹𝛽 ↑ .9162 .9153 .9062 .9154 NLPR 𝑆𝛼 ↑ .9273 .9261 .9219 .9258 𝑀𝐴𝐸 ↓ .0240 .0251 .0253 .0248 𝐹𝛽 ↑ .9372 .9327 .9222 .9296 DUT 𝑆𝛼 ↑ .9274 .9226 .9184 .9238 𝑀𝐴𝐸 ↓ .0302 .0338 .0369 .0334 Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China 4.4 Ablation Studies Table 3: The effectiveness analyses of discrepant interaction structure on the NLPR and DUT datasets. We conduct thorough ablation studies to analyze the effects of individual components in our CDINet on the NLPR and DUT datasets. The quantitative results are reported in Table 2, in which the first line (i.e., CDINet) shows the performance of our full model. Moreover, we also explore the role of different interaction modes in Table 3. Effectiveness of RDE module. First, we remove the RDE module to verify its role, denoted as ’w/o RDE’, which means two modality information that does not interact in the first two layers. Compared with the full model listed in Table 2, the performance is moderately enhanced after adding the RDE module, which achieves the percentage gain of 4.4% and 10.7% in terms of MAE score on the NLPR dataset and DUT dataset, respectively. This experiment shows the effectiveness of the RGB modality to guide the depth modality through the RDE module. Effectiveness of DSE module. We also directly delete the DSE module in the 3𝑟𝑑 , 4𝑡ℎ and 5𝑡ℎ encoders of two-stream backbone network, and then add the RGB features and depth features of the top layer for decoding. The results in the third row (i.e., ’w/o DSE’) of Table 2 demonstrate the positive effect of the DSE module. On the NLPR dataset and DUT dataset, without the DSE module, the max F-measure is decreased by 1.1% and 1.6%, respectively. In addition, we show the visualization of the RGB features before and 5 ) in Figure 4. With the introduction of after DSE (i.e., 𝑓𝑟5 and 𝑓𝑜𝑢𝑡 depth guidance through the DSE module, the internal response of objects in the RGB features is obviously improved, while the abnormal noise in the background area is effectively suppressed. We attribute the performance improvement to the attention-level and feature-level interactive patterns, which assists the RGB modality in capturing clearer and fine-grained semantic attributes. RGB Depth GT f r5 Number No.1 No.2 No.3 𝐹𝛽 ↑ .9162 .9153 .9160 NLPR 𝑆𝛼 ↑ .9273 .9261 .9298 𝑀𝐴𝐸 ↓ .0240 .0251 .0242 𝐹𝛽 ↑ .9372 .9295 .9328 DUT 𝑆𝛼 ↑ .9274 .9217 .9246 𝑀𝐴𝐸 ↓ .0302 .0345 .0327 of experiments to analyze the variants of interaction mode. First, we reset the guidance direction of the RSE module to the depth branch pointing to the RGB branch in the first two layers, forming the unidirectional interaction mode shown in Figure 1(a). As reported in the second line (denoted as ’No.2’) of Table 3, compared with the full model (denoted as ’No.1’), the performance is obviously dropped on the DUT dataset. Furthermore, we also verify the bidirectional interaction mode shown in Figure 1(b) by symmetrically inserting the RSE module and the DSE module into two branches. The results (denoted as ’No.3’) show that almost all indicators of CDINet (i.e., No.1) are optimal, except for the S-measure on the DUT dataset. However, this performance improvement of the bidirectional interaction mode comes at the cost of increased computation and the number of parameters, compared with the final model, the parameter amount is increased by 10M and inference time changes from 42FPS to 32FPS. In general, our model achieves better results on the basis of considering both performance and computation cost. 5 CONCLUSION In this paper, we explore a novel cross-modality interaction mode and propose a cross-modality discrepant interaction network, which explicitly models the dependence of two modalities in different convolutional layers. To this end, two components (i.e., RDE module and DSE module) are designed to achieve differentiated cross-modality guidance. Furthermore, we also put forward a DDR structure, which generates a semantic block by leveraging multiple high-level features to upgrade the skip connection. The comprehensive experiments demonstrate that our network achieves competitive performance against state-of-the-art methods on five benchmark datasets, and our inference speed reaches the real-time level (i.e., 42 FPS). 5 f out ACKNOWLEDGMENTS Figure 4: Feature visualization of the DSE module in the last layer of backbone. This work was supported by the Beijing Nova Program under Grant Z201100006820016, in part by the National Key Research and Development of China under Grant 2018AAA0102100, in part by the National Natural Science Foundation of China under Grant 62002014, Grant U1936212, in part by Elite Scientist Sponsorship Program by the China Association for Science and Technology under Grant 2020QNRC001, in part by General Research Fund-Research Grants Council (GRF-RGC) under Grant 9042816 (CityU 11209819), Grant 9042958 (CityU 11203820), in part by Hong Kong Scholars Program under Grant XJ2020040, in part by CAAI-Huawei MindSpore Open Fund, and in part by China Postdoctoral Science Foundation under Grant 2020T130050, Grant 2019M660438. Effectiveness of DDR structure. We replace the proposed dense decoding reconstruction structure by using a corresponding layer skip connection similar to U-net [34], and the result is shown in the last line (i.e., w/o DDR) of Table 2. By comparing with the full network, it can be seen that our decoding strategy improves three metrics on two testing datasets, especially achieves the percentage gain of 0.8% for max F-measure and 9.6% for MAE score on the DUT dataset. Effectiveness of Discrepant Interaction. In this work, we propose a novel interaction architecture, here, we conduct a couple 2101 Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China REFERENCES in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 57, 11 (2019), 9156–9166. [25] Chongyi Li, Runmin Cong, Sam Kwong, Junhui Hou, Huazhu Fu, Guopu Zhu, Dingwen Zhang, and Qingming Huang. 2021. ASIF-Net: Attention steered interweave fusion network for RGB-D salient object detection. IEEE Transactions on Cybernetics 51, 1 (2021), 88–100. [26] Chongyi Li, Runmin Cong, Yongri Piao, Qianqian Xu, and Chen Change Loy. 2020. RGB-D salient object detection with cross-modality modulation and selection. In Proceedings of the European Conference on Computer Vision. Springer, 225–241. [27] Nianyi Li, Jinwei Ye, Yu Ji, Haibin Ling, and Jingyi Yu. 2014. Saliency detection on light field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2806–2813. [28] Nian Liu, Ni Zhang, and Junwei Han. 2020. Learning selective self-mutual attention for RGB-D saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 13756–13765. [29] Yuzhen Niu, Yujie Geng, Xueqing Li, and Feng Liu. 2012. Leveraging stereopsis for saliency analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 454–461. [30] Youwei Pang, Lihe Zhang, Xiaoqi Zhao, and Huchuan Lu. 2020. Hierarchical dynamic filtering network for RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision. Springer, 235–252. [31] Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, and Rongrong Ji. 2014. RGBD salient object detection: A benchmark and algorithms. In Proceedings of the European Conference on Computer Vision. Springer, 92–109. [32] Yongri Piao, Wei Ji, Jingjing Li, Miao Zhang, and Huchuan Lu. 2019. Depthinduced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 7254–7263. [33] Yongri Piao, Zhengkun Rong, Miao Zhang, Weisong Ren, and Huchuan Lu. 2020. A2dele: Adaptive and attentive depth distiller for efficient RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 9060–9069. [34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234–241. [35] Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations. 1–14. [36] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7794–7803. [37] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision. Springer, 3–19. [38] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning. PMLR, 2048–2057. [39] Miao Zhang, Sun Xiao Fei, Jie Liu, Shuang Xu, Yongri Piao, and Huchuan Lu. 2020. Asymmetric two-stream architecture for accurate RGB-D saliency detection. In Proceedings of the European Conference on Computer Vision. Springer, 374–390. [40] Miao Zhang, Weisong Ren, Yongri Piao, Zhengkun Rong, and Huchuan Lu. 2020. Select, supplement and focus for RGB-D saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3472–3481. [41] Miao Zhang, Yu Zhang, Yongri Piao, Beiqi Hu, and Huchuan Lu. 2020. Feature reintegration over differential treatment: A top-down and adaptive fusion network for RGB-D salient object detection. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 4107–4115. [42] Qijian Zhang, Runmin Cong, Chongyi Li, Ming-Ming Cheng, Yuming Fang, Xiaochun Cao, Yao Zhao, and Sam Kwong. 2021. Dense attention fluid network for salient object detection in optical remote sensing images. IEEE Transactions on Image Processing 30 (2021), 1305–1317. [43] Zhao Zhang, Zheng Lin, Jun Xu, Wen-Da Jin, Shao-Ping Lu, and Deng-Ping Fan. 2021. Bilateral attention network for RGB-D salient object detection. IEEE Transactions on Image Processing 30 (2021), 1949–1961. [44] Jia-Xing Zhao, Yang Cao, Deng-Ping Fan, Ming-Ming Cheng, Xuan-Yi Li, and Le Zhang. 2019. Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3927–3936. [45] Xiaoqi Zhao, Lihe Zhang, Youwei Pang, Huchuan Lu, and Lei Zhang. 2020. A single stream network for robust and real-time RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision. Springer, 646–662. [1] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. 2015. Salient object detection: A benchmark. IEEE Transactions on Image Processing 24, 12 (2015), 5706–5722. [2] Hao Chen and Youfu Li. 2019. Three-stream attention-aware network for RGBD salient object detection. IEEE Transactions on Image Processing 28, 6 (2019), 2825–2835. [3] Hao Chen, Youfu Li, and Dan Su. 2019. Multi-modal fusion network with multiscale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition 86 (2019), 376–385. [4] Shuhan Chen and Yun Fu. 2020. Progressively guided alternate refinement network for RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision. Springer, 520–538. [5] Zuyao Chen, Runmin Cong, Qianqian Xu, and Qingming Huang. 2021. DPANet: Depth potentiality-aware gated attention network for RGB-D salient object detection. IEEE Transactions on Image Processing 30 (2021), 7012–7024. [6] Runmin Cong, Jianjun Lei, Huazhu Fu, Ming-Ming Cheng, Weisi Lin, and Qingming Huang. 2019. Review of visual saliency detection with comprehensive information. IEEE Transactions on Circuits and Systems for Video Technology 29, 10 (2019), 2941–2959. [7] Runmin Cong, Jianjun Lei, Huazhu Fu, Junhui Hou, Qingming Huang, and Sam Kwong. 2019. Going from RGB to RGBD saliency: A depth-guided transformation model. IEEE Transactions on Cybernetics 50, 8 (2019), 3627–3639. [8] Runmin Cong, Jianjun Lei, Huazhu Fu, Qingming Huang, Xiaochun Cao, and Chunping Hou. 2017. Co-saliency detection for RGBD images based on multiconstraint feature matching and cross label propagation. IEEE Transactions on Image Processing 27, 2 (2017), 568–579. [9] Runmin Cong, Jianjun Lei, Huazhu Fu, Qingming Huang, Xiaochun Cao, and Nam Ling. 2019. HSCS: Hierarchical sparsity based co-saliency detection for RGBD images. IEEE Transactions on Multimedia 21, 7 (2019), 1660–1671. [10] Runmin Cong, Jianjun Lei, Huazhu Fu, Weisi Lin, Qingming Huang, Xiaochun Cao, and Chunping Hou. 2019. An iterative co-saliency framework for RGBD images. IEEE Transactions on Cybernetics 49, 1 (2019), 233–246. [11] Runmin Cong, Jianjun Lei, Huazhu Fu, Fatih Porikli, Qingming Huang, and Chunping Hou. 2019. Video saliency detection via sparsity-based reconstruction and propagation. IEEE Transactions on Image Processing 28, 10 (2019), 4819–4831. [12] Runmin Cong, Jianjun Lei, Changqing Zhang, Qingming Huang, Xiaochun Cao, and Chunping Hou. 2016. Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion. IEEE Signal Processing Letters 23, 6 (2016), 819–823. [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255. [14] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. 2017. Structuremeasure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4548–4557. [15] Deng-Ping Fan, Zheng Lin, Zhao Zhang, Menglong Zhu, and Ming-Ming Cheng. 2020. Rethinking RGB-D salient object detection: Models, data sets, and largescale benchmarks. IEEE Transactions on Neural Networks and Learning Systems 32, 5 (2020), 2075–2089. [16] David Feng, Nick Barnes, Shaodi You, and Chris McCarthy. 2016. Local background enclosure for RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2343–2350. [17] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3146–3154. [18] Keren Fu, Deng-Ping Fan, Ge-Peng Ji, and Qijun Zhao. 2020. JL-DCF: Joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3052–3062. [19] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 345–360. [20] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7132–7141. [21] Zhou Huang, Huai-Xin Chen, Tao Zhou, Yun-Zhi Yang, and Bi-Yuan Liu. 2021. Multi-level cross-modal interaction network for RGB-D salient object detection. Neurocomputing 452 (2021), 200–211. [22] Ran Ju, Ling Ge, Wenjing Geng, Tongwei Ren, and Gangshan Wu. 2014. Depth saliency based on anisotropic center-surround difference. In IEEE International Conference on Image Processing. IEEE, 1115–1119. [23] Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient inference in fully connected CRFs with gaussian edge potentials. In Advances in Neural Information Processing Systems. MIT press, 109–117. [24] Chongyi Li, Runmin Cong, Junhui Hou, Sanyi Zhang, Yue Qian, and Sam Kwong. 2019. Nested network with two-stream pyramid for salient object detection 2102