◆ 贾麒霏, 韦世奎, 阮涛, 赵玉凤, 赵耀, “GradingNet: Towards Providing Reliable Supervisions for Weakly Supervised Object Detection by Grading the Box Candidates,” AAAI 2021, pp. 1682-1690, May. 2021..pdf

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) GradingNet: Towards Providing Reliable Supervisions for Weakly Supervised Object Detection by Grading the Box Candidates Qifei Jia1,2 , Shikui Wei1,2∗ , Tao Ruan1,2 , Yufeng Zhao3 , Yao Zhao1,2 1 2 Institute of Information Science, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China 3 China Academy of Chinese Medical Sciences, Beijing, China 1,2 {18120303,shkwei,16112064,yzhao}@bjtu.edu.cn, 3 snowmanzhao@163.com Abstract and classifier estimation. Though many exciting results have been achieved, they are still far from comparable to fully supervised methods (Girshick 2015; Ren et al. 2015; Redmon et al. 2016). This is mainly because fully supervised methods make use of the strong learning ability of CNN to fit the datasets with accurate region-level annotations. Therefore, multiple instance learning networks can only solve part of the WSOD problem, and some methods treat it as a sub-task of WSOD and then carry out a series of follow-up processing to pursue higher performance. Recently, some works (Tang et al. 2017, 2018; Yang, Li, and Dou 2019) solve the WSOD problem with a two-stage framework, which uses the top-ranked proposal produced by the weakly-supervised method to train a supervised detector. Since the top-ranked proposal only finds one ground-truth for each category, it will lose many informative proposals in complex visual scenes. To handle this problem, W2F (Zhang et al. 2018) reports a PGE algorithm to better find the pseudo ground-truth. However, since some weakly-supervised detectors are unstable, the generated proposals have some randomness. The mechanical processing method, like PGE, sometimes results in worse effects. In addition to the above shortcomings, these methods also have a common problem, i.e., they focus only on the performance-boosting of the first stage (weakly-supervised) and assume that the pseudo ground-truth produced by it is accurate in the second stage. However, due to the inherent shortcomings of the weakly-supervised detector, the detected proposals are generally incomplete and inaccurate. Using these proposals as pseudo ground-truth to train a fullysupervised detector in the second stage (fully-supervised) will lead to two problems. Firstly, some of the anchors generated by using the pseudo ground-truth will be misclassified, which will greatly mislead the model. Secondly, lots of pseudo ground-truths can only cover a small and discriminative part of objects, while some can cover the whole object regions. To address the above-mentioned problems, we propose a new solution for the WSOD problem, which divides it into a preliminary WSOD problem and a mixed problem of incomplete and inaccurate supervision (Zhou 2018). Boxes Grading Module (BGM) and Informative Boosting Module (IBM) are carefully designed to solve the two-stage problem separately. BGM evaluates the quality of each propos- Weakly-Supervised Object Detection (WSOD) aims at training a model with limited and coarse annotations for precisely locating the regions of objects. Existing works solve the WSOD problem by using a two-stage framework, i.e., generating candidate bounding boxes with weak supervision information and then refining them by directly employing supervised object detection models. However, most of such works focus mainly on the performance-boosting of the first stage, while ignoring the better usage of generated candidate bounding boxes. To address this issue, we propose a new two-stage framework for WSOD, named GradingNet, which can make good use of the generated candidate bounding boxes. Specifically, the proposed GradingNet consists of two modules: Boxes Grading Module (BGM) and Informative Boosting Module (IBM). BGM generates proposals of the bounding boxes by using standard one-stage weakly-supervised methods, then utilizes the Inclusion Principle to pick out highlyreliable boxes and evaluate the grade of each box. With the above boxes and their grade information, an effective anchor generator and a grade-aware loss are carefully designed to train the IBM. Taking the advantages of the grade information, our GradingNet achieves state-of-the-art performance on COCO, VOC 2007, and VOC 2012 benchmarks. Introduction Object detection aims at locating and recognizing objects of interest in given images of various scenes. For a long period of time, a number of methods were proposed to solve the challenges of object detection (Lin et al. 2017; Dai et al. 2016; Wang et al. 2019). Although remarkable progress has been achieved, it is time-consuming and laborintensive to annotate accurate object bounding boxes for a dataset. Therefore, weakly-supervised object detection (WSOD), which only uses image-level labels for training, is considered a promising solution to the problem. Traditionally, most of the previous methods (Bilen, Pedersoli, and Tuytelaars 2015; Song et al. 2014; Li et al. 2016; Hoffman et al. 2015) attempt to address the WSOD problem by employing Multiple Instance Learning (MIL) network. In particular, they first decompose images into object proposals and then use MIL to iteratively perform proposal selection ∗ Corresponding author Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1682 the first stage, the MIL network is used, which uses CNN as detectors to activate regions of interest on the feature maps and localizes objects by leveraging spatial distributions and informative patterns captured in the convolutional layers. In the second stage, a fully supervised detector is trained to further refine object location by using the selected boxes of the first phase as supervision. The main functionality of the second stage is to regress the object locations more precisely. In this paper, we design Boxes Grading Module (BGM) to process the information produced by the first stage and help with better training in the second stage. Fully-supervised detector used in WSOD: Object detection has been widely studied in the last few decades, many methods have been proposed, such as the Fast R-CNN (Girshick 2015), Faster R-CNN (Ren et al. 2015) and other methods (Lin et al. 2017; Dai et al. 2016; Wang et al. 2019) based on them. Specifically, Faster R-CNN improved Fast R-CNN and has achieved breakthrough in speed. Though great progress has been achieved, fully-supervised methods still require accurately bounding-box annotations, which are expensive and time-consuming. In this paper, we design Informative Boosting Module (IBM) based on fullysupervised methods to train detector and use Graded-Labels produced by BGM to replace bounding-box annotations. al to make Graded-labels. Specifically, we use the stability of the model to find stable boxes, and then utilize the Inclusion Principle to find out high-quality proposals and make them into Graded-Labels. IBM trains a detector by using these Graded-Labels. Different from the traditional twostage method, we do not see the second stage as a fullysupervised process. Instead, we regard it as an incomplete supervised problem (the bounding boxes in Graded-Labels are incomplete) and an inaccurate supervision problem (the coordinates of bounding boxes are inaccurate). To tackle the first problem, IBM generates anchors in a predictive way. To solve the second one, a grade-aware loss is adopted by IBM. Besides, we also propose Area-balanced Sampling to control the proportion of positive and negative samples. In brief, there are four main contributions in this paper: 1) A new two-stage framework is proposed for solving the WSOD problem, which divides the WSOD problem into a preliminary WSOD problem and a mixed problem of incomplete and inaccurate supervisions. 2) A Boxes Grading Module is proposed to grade the generated proposals, which provides more reliable bounding boxes for the following supervised stage. 3) An Informative Boosting Module is proposed for using the Graded-labels produced by BGM to train a supervised detector, which further improves detection performance. 4) The proposed method achieves state-of-the-art performance on COCO, VOC 07, and VOC 12 benchmarks. Method The overview of GradingNet is illustrated in Figure 1, which include two main modules: Boxes Grading Module (BGM) and Informative Boosting Module (IBM). The BGM utilizes a novel inclusion principle to select high-quality regions from object proposals and categorizes them into various Graded-Labels. Then, IBM makes use of those GradedLabels to train a more accurate detector. Related Work Weakly Supervised Learning: Common weakly supervised learning methods are Multiple Instance Learning (MIL) and Latent Variable Learning (LVL). MIL splits the image into positive and negative parts, each image is considered as a bag of candidate object instances. Most existing WSOD methods (Gokberk Cinbis, Verbeek, and Schmid 2014; Cinbis, Verbeek, and Schmid 2017; Wang et al. 2015; Hoffman et al. 2015) treat the problem as a MIL problem. However, positive object instances sometimes focus on the most discriminative parts of an object instead of the whole region, which causes the inaccurate object localization of detectors. In addition, since the underlying MIL optimization is nonconvex, it is sensitive to positive instance initialization and tends to get trapped in local optima. Some LVL algorithms are also used to solve the WSOD problem. Clustering methods (Song et al. 2014; Tang et al. 2018) recognize latent objects by finding the most discriminative clusters. Latent SVM (Yu and Joachims 2009; Ye et al. 2017) optimizes the learned object locations by Expectation-Maximization algorithm. Entropy based methods (Miller et al. 2012; Bouchacourt, Nowozin, and Pawan Kumar 2015; Wan et al. 2018) use entropy in LVL to measure the randomness of object localization during the learning process. Unfortunately, these methods often become stuck in a poor local minimum just like MIL. Two-stage WSOD approaches: Recently, some WSOD methods (Zhang et al. 2018; Tang et al. 2017; Wei et al. 2018; Tang et al. 2018; Yang, Li, and Dou 2019; Bilen and Vedaldi 2016; Kantorov et al. 2016) follow a two-stage procedure which uses the strong regression ability of fullysupervised detector to guide weakly-supervised detection. In Inclusion Principle For a long period of time, a quantity of WSOD works is proposed to solve the challenges of locating the complete object. Although remarkable progress has been achieved, most of the methods are only able to locate the discriminative parts of objects. To tackle this problem, we first propose a simple Inclusion Principle. Overview: The idea of the Inclusion Principle comes from the relationship of object parts, i.e., the whole object contains part of object, which contains the most discriminative part. For example, the human body contains upper part of human, and the upper part of human contains head. When the detector detects many parts within an object, we can use this principle to judge which part is the most complete. Details: Inclusion is an abstract conception and we use the following Eq.(1) to quantify it. For boxes A and B, we define Intersection over Single (IoS) to measure the degree of inclusion, A∩B IoS (A, B) = (1) B Under this definition, if IoS (A, B) > TIoS , TIoS is the overlap threshold, we determine that A include B. Based on the principle, if we locate the most discriminative part of object, the whole object will be found. Therefore, we first propose a method to find the discriminative part. 1683 Figure 1: Overview of the proposed GradingNet, WSOD modules can adopt any existing one-stage weakly-supervised methods. Finding the Stable Boxes For a certain box A in P , we find the boxes B and C closest to it from Pδ1 and Pδ2 according to the IoU. And the stability (ST ) of box A can be calculated as: A well-trained detector can localize discriminative parts in an image, even though the image is perturbed by a batch of noises. According to this characteristic, the discriminative part is named stable box and we expect to utilize this kind of stability of a detector to find them. Hence, we first augment the dataset with our carefully designed noises, then feed them into a pre-trained weakly supervised detector and take the predicted boxes as candidates. Specifically, for each image I i ∈ D, i = 1, 2, ..., |D| in the dataset D, we generate two kinds of additive noises, δ1 and δ2 . δ1 is generated by randomly changing 0.5% of pixels while the other is changing 1% of pixels. We add these noises on the original image to get two perturbed images, i.e., Iδi1 = I i + δ1i and Iδi2 = I i + δ2i . The original images together with the noisy images from the augmented dataset D+ : D + = D ∪ D δ1 ∪ D δ2 Dδ1 = {Iδi1 = I i + δ1i |i = 1, 2, ..., |D|} Dδ2 = {Iδi2 = I i + δ2i |i = 1, 2, ..., |D|} STA = λ1 IoU (A, B) + λ2 IoU (A, C) where λ1 and λ2 are parameters that weight the importance of B and C, respectively. The process is repeated until the ST of all boxes in P , Pδ1 and Pδ2 are calculated and in which NMS operates on it, only the boxes whose ST larger than a pre-defined threshold Tst constitutes the stable boxes collection Pst . The collection Pot which consists of all boxes except stable boxes is confirmed in the meantime. Pst = {bist |bist ∈ R4 , i = 1, ..., |P |} Pot = {biot |biot ∈ R4 , i = 1, ..., |P |} The BGM relying on the Inclusion Principle use stable boxes to select high-quality regions from object proposals. Given a set of proposals P = {p1 ,...,pN } in image X, the corresponding confidence scores output by the one-stage weaklysupervised detector is unreliable. Therefore, we first initialize a set of scores S = {s1 ,...,sN } and a set of grades G = {g1 ,...,gN }. The scores are used as a standard to identify a high-quality subset Pg of boxes. And the grades further evaluate the quality of selected boxes. We initialize the sets of S and G according to the Inclusion Principle. Each box is represented as a node in the oriented graph (Figure 2), and two nodes are defined as inclusion if the IoS (Eq.(1)) of their corresponding boxes is below a threshold (solid line in Figure 2). The BGM provides a method to find the high-quality region (node 1 in Figure 2): To encourage selecting regions containing complete object, we define a way to update the scores of parent nodes (nodes 1 and 2 in Figure 2) using their children nodes. (2) (3) where bi+ represents a box candidate, and si+ is the corresponding confidence score within [0, 1]. To coarsely purify the predictions, we perform NMS with a loose threshold Tnms1 on each P+ . Then, we use the stability of weakly-supervised method to find stable boxes. Considering the collections P , Pδ1 and Pδ2 on the same image in D, Dδ1 and Dδ2 : P = {bi |bi ∈ R4 , i = 1, ..., |P |} Pδ1 = {biδ1 |biδ1 ∈ R4 , i = 1, ..., |P |} Pδ2 = {biδ2 |biδ2 ∈ R4 , i = 1, ..., |P |} (6) Boxes Grading Module A weakly supervised detector takes the images in D+ as the input and outputs predicted box proposals. The design of such a detector is beyond the scope of this paper, therefore, we apply an existing method (e.g., OICR (Tang et al. 2017)) to execute the above process. For each image I in D+ , a collection P+ of the box candidates with confidence scores is generated: P+ = {(bi+ , si+ )|bi+ ∈ R4 , si+ ∈ R, i = 1, ..., |P |} (5) ScoreA = ScoreA +ScoreB IoS (A, B) (7) Based on the Inclusion Principle, if regions contain more complete object boxes, their corresponding nodes will have more children nodes. For this reason, the score S of a more complete object box is updated more times and gets a larger (4) 1684 Algorithm 1 Boxes Grading Module Input: Image X in Datasets D+ ; Stable boxes collection Pst ; Other boxes collection Pot ; Tnms2 ; TIoS Output: Boxes Pg 1: Sst ← {0.2}N ; Sot ← {0.1}N ; S = Sst ∪ Sot 2: Gst ← {1}N ; Got ← {0}N ; G = Gst ∪ Got 3: k ← 0; i ← 1 4: while k 6= 2 do 5: for A ∈ Pot and B ∈ Pst ∪ Pot do 6: if GB = i and IoS(A, B) > TIoS then 7: G ← GA ← i + 1 8: S ← SA ← Compute S using Eq.(7) 9: if k > 0 then 10: k←k-1 11: end if 12: else 13: i←i+1 14: k←k+1 15: end if 16: end for 17: end while 18: Pg = nms(Pst ∪ Pot , S, Tnms2 ) Figure 2: All the detected proposals (left) and oriented graph representation of them (right). Our goal is to select the green boxes and we achieve this by analyzing the inclusion relationship between proposals. value. And its grade G is directly assigned to a higher value than its children node. Besides, to suppress selecting regions containing many objects. The score of their corresponding parent node can only be updated by a single child node with the same grade G. If a parent node has many children nodes with G = i, its score is updated only once. However, if it also has child node with G = j (i 6= j), its score can be updated normally. The BGM algorithm is summarized in Algorithm 1. We obtain the boxes Pg required by Graded-Labels and the grade Gg of it. Although Gg can evaluate the quality of Pg , we still require a better parameter to avoid the influence of different grade distribution on different category. So we count the total number m of boxes in each category (all images) and use the parameter m and G (each box) with category = n to calculate the reliability (Rel) using Eq.(8). The parameter Rel replace Gg are included in Graded-Labels to measure the quality of a box. RelA = GA m ∗ GA = P Gavg G the bounding box (xg , yg , wg , hg ) which represents a box with wg width, hg and (xg , yg ) as the center point to the corresponding feature map scale, and obtain (x0g , yg0 , wg0 , h0g ). For positive anchor, we generate center box (x0g , yg0 , Kwg0 , Kh0g ), the center point generated in it. For negative anchors, we generate outbox1 (x0g , yg0 , (K + 1)wg0 , (K + 1)h0g ) and outbox2 (x0g , yg0 , Lwg0 , Lh0g ), the center point generated in the region which included by outbox1 but excluded by outbox2. After the center point is determined, we generate 3 scales with box areas of 128/256/512 squares pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1 in each point of the positive/negative center points region. For training, in the region of positive/negative center points, each point chooses one proposal which has the largest/smallest IoU with GradedLabels in 9 proposals. For testing, we choose all the 9 proposals as anchors. The whole process is shown in Figure 3. Area-balanced Sampling: For the problem of unbalanced positive and negative samples in Graded-Labels, we propose a novel solution. Because of the Predicted Anchoring, the number of positive/negative anchors is directly determined by the area of center points region. We can control the sample proportion by dynamically adjusting parameters K and L mentioned in Predicted Anchoring. Since the bounding boxes with higher Rel (Eq.(8)) are more accurate and the anchor generated near it has higher quality, we set K = Rel . According to experience, the ratio of positive and negative samples is 1:3, and we can get the Eq.(9), (8) cls=n Informative Boosting Module IBM trains a detector by making use of CNN to fit the Graded-Labels and further improve detection performance. All components will be detailed in the following. Predicted Anchoring: Since the bounding boxes in labels are incomplete, if we generate anchors on the whole feature map, some of the anchors will be misclassified. Inspired by GA-RPN (Wang et al. 2019), we predict the position of the anchor’s center point to generate anchors around the Graded-Labels. Specifically, we perform a 1 × 1 convolution on the feature map, dividing the output into two signalchannel maps. The element-wise sigmoid function is applied on those two maps to get two probability maps FC1 and FC2. Each value in FC1 (FC2) illustrates how likely this position is a center point of a positive (negative) anchor. To train the center point probability matrix of positive/negative anchors separately, we use two binary label maps where 1 represents a valid location to place the center point of anchor and 0 represents other regions. We first map K2 = 1 3 (9) (K + 1) −L2 then parameter L is determined by the following Eq.(10) q 2 L = (K + 1) − 3K 2 (10) 2 1685 2014 (Lin et al. 2014) dataset (about 80K images for training, 40K images for validation) which is the popular dataset used for supervised object detection but rarely used in WSOD. In all experiments, we only use image-level labels. For evaluation on VOC 2007 and 2012, we use two kinds of measurements: 1) Average Precision (AP) and the mean of AP (mAP) on the test set. 2) CorLoc on the trainval set. All metrics are based on the PASCAL criterion. For evaluation on MS COCO, we use two main metrics AP and AP50 which are the standard MS COCO criterion. Implementation Details All the experiments are implemented based on PyTorch on 4 NVIDIA GeForce GTX 2080Ti. All the settings of our four baselines (OICR, PCL, MELM and C-MIL) are kept identical to (Tang et al. 2017; Tang et al. 2020; Wan et al. 2018, 2019) For BGM, both λ1 and λ2 are set to 0.5. The threshold Tnms1 and Tst for NMS are set to 0.8 and 0.7 respectively. The thresholds Tnms2 and TIoS for NMS is set to 0.7 and 0.6 respectively. For IBM, VGG16 model is adopted as our backbone network. We set λc and λr to 1 and λp to 0.5. The batch size is set to 16, initialize the learning rate as 1×10−3 , and then decrease it to 1 × 10−4 and 1 × 10−5 at 3 epochs and 6 epochs, eventually stop at 7 epochs. Figure 3: Overview of the predicted anchoring. The center point of positive anchor is included by the center box and negative anchor is included by the outbox1 but excluded by the outbox2. We generate 9 size anchors on all the center point and choose 1 proposal which has the largest/smallest IoU with Graded-Labels as positive/negative sample. Grade Loss: Our IBM is trained with the following multitask loss: Lmulti = λc Lcls + λr Lreg + λp Lpa Ablation Studies (11) We conduct some ablation experiments on the COCO 2014 and VOC 2007 datasets, which include the influence of each part in GradingNet and the individual influence of IBM. Influence of the GradingNet: We present the GradingNet as a combination of Head, Neck and Body part. Each part contains some alternative approaches. The experimental results are shown in Table 1. We can observe that even using the simple Neck and Body (TS and Fast R-CNN), the results are improved compared to only use that original weaklysupervised method, i.e., 0.6% AP improvement in OICR and 0.9% AP improvement in PCL, 0.5% AP50 improvement in OICR and 0.3% AP50 improvement in PCL. We attribute this to the selective ability of the Neck and regression ability of the Body. 1) The choice of Neck: When we use the same Body (Fast R-CNN, Faster R-CNN or IBM), our BGM performs better than TS and PGE in most evaluation metrics. For example, we use OICR as Head and Fast R-CNN as Body, combining with the BGM designed by us, the framework achieves 8.3% AP , which is 0.8% higher than the framework that uses TS as Neck and 0.6% higher than the framework that uses PGE as Neck respectively. We can observe that only our BGM has significant improvement in all weakly-supervised methods, while TS and PGE perform poorly in MELM. This proves that the BGM which use the instability of weakly-supervised detector is more general. 2) The choice of Body: Similarly, when we use the same Neck (TS, PGE or BGM), our IBM also performs better than Fast R-CNN and Faster R-CNN in most evaluation metrics. This is mainly because our IBM has a better fit ability to the Graded-Labels. Influence of the IBM: To study the influence of IBM, we compute the proposal Recall at different IoU thresholds with where Lcls is the commonly used classification loss in detection tasks (Ren et al. 2015), Lreg is the regression loss including grade information and Lpa is an additional loss for the predicted anchor localization. For Lreg , since using the bounding box with low Rel to train a model will lead to inaccurate location. A proper way is to link regression loss with Rel to adjust the proportion of loss provided by labels with different Rel, as follows: X Lreg = Reli p∗i Lr (ti , t∗i ) (12) i where ti is a vector representing the 4 coordinates of the predicted bounding box, and t∗i is that of the box in GradedLabels associated with a positive anchor. Lr represents smooth L1 loss. Such loss makes the box with lower reliability have less impact on the regression loss. For Lpa , we use the sum of two log losses over two classes (positive vs Non-positive samples, negative vs Non-negative samples). X X Lpa = Llog (Pi , Pi∗ ) + Llog Nj , Nj∗ (13) i j Experiments Datasets and Evaluation Metrics The training datasets we use in all experiments are three challenging benchmarks in object detection: PASCAL VOC 2007 (5011 images for training, 4952 images for testing), PASCAL VOC 2012 (Everingham et al. 2015) (11540 images for training, 10991 images for testing) datasets which are widely used as benchmarks for WSOD. And MS COCO 1686 Head OICR PCL X X X X X X X X X X X X X X X X X X X X TS X X X X X X Neck PGE BGM (ours) Fast X X X X X X X X X X X X X X X X X X Body Faster IBM(ours) X X X X X X X X X X X X AP 6.9 7.5 7.3 7.8 7.7 7.2 7.9 8.3 8.0 8.7 +1.8 8.3 9.2 8.9 9.5 9.4 9.3 9.5 10.1 9.9 10.8 +2.5 AP50 16.5 17.0 17.1 18.0 18.5 17.4 18.6 18.9 19.0 20.1 +3.6 19.1 19.4 19.1 19.3 18.6 18.9 19.0 22.6 21.4 22.9 +3.8 Metric AP75 APS 6.1 3.1 6.3 3.2 5.8 3.1 7.0 3.1 7.3 3.0 7.0 3.1 8.5 3.0 8.1 3.3 8.2 3.2 8.5 3.3 +2.4 +0.2 7.6 3.3 8.1 3.3 8.0 3.4 8.2 3.2 8.3 3.1 8.3 3.2 8.1 3.4 8.2 3.6 7.8 3.5 9.0 3.5 +1.4 +0.2 APM 7.8 8.4 8.6 8.9 9.0 8.5 9.2 9.4 9.1 9.8 +2.0 9.5 10.2 10.5 10.5 10.7 10.4 10.9 10.6 10.4 11.1 +1.6 APL 12.9 13.1 13.5 14.1 14.7 13.6 15.3 16.0 15.4 16.9 +4.0 15.3 16.2 16.0 16.8 15.8 16.5 16.5 17.0 16.3 17.3 +2.0 Table 1: Ablation study on COCO 2014 validation set. TS represents only using top-scoring proposals. PGE is proposed in W2F (Zhang et al. 2018) and we reproduce it according to the paper. Fast/Faster represent Fast/Faster R-CNN. (a) 300 proposals (b) 1000 proposals (c) 2000 proposals Figure 4: IoU-Recall curve for different number of proposal (300, 1000 and 2000) methods on the VOC 2007 test set. Both IBM and RPN are trained by Graded-Labels. ground-truth and draw the IoU-Recall curve. As shown in Figure 4, our method obtains higher Recall than the RPN (Ren et al. 2015). We also report the influence of different parameters K and L in IBM on Recall. It shows that the values of K and L which strictly follow our Method section are optimal. We attribute this to the Predicted Anchoring and Area-balanced Sampling in IBM. 48.1%, 49.4%, 52.5% and 54.3% mAP, respectively, which are significantly higher than the baselines and the baselines re-trained by fully-supervised detector. In addition, Our GradingNet-C-MIL results (54.3% mAP) surpass all the previous methods. Table 3 shows the Corloc on VOC 07 trainval set. Similarly, added the GradingNet, OICR, PCL, MELM and C-MIL achieve 68.6%, 69.1%, 63.2% and 72.1% Corloc, respectively, which are also higher than the baselines and the baselines re-trained by fully-supervised detector. GradingNet-C-MIL achieves the highest result (72.1%) and surpass all the previous methods. Table 3 also shows our performance in terms of mAP and Corloc on the VOC 12 test and trainval sets. The results also show the huge improvement between our methods and baselines. And our GradingNet-C-MIL also achieves the highest mAP (50.5%) and the second highest CorLoc (71.9%). Comparison with State-of-the-Art PASCAL VOC: For fairly compare with most state of the art WSOD works, we evaluate our method on the PASCAL VOC datasets. Table 2 shows the mAP on VOC 07 test set, we only show the AP of 10 object categories, but mAP is calculated based on all the 20 object categories. We present the performance of our GradingNet on the four baselines. Added the GradingNet, OICR, PCL, MELM and C-MIL achieve 1687 Method OICR (Tang et al. 2017) PCL (Tang et al. 2020) MELM (Wan et al. 2018) C-MIL (Wan et al. 2019) WSOD2 (Zeng et al. 2019) C-MIDN (Gao et al. 2019) OIM (Lin et al. 2020) SLV (Chen et al. 2020) OICR+FRCNN (Tang et al. 2017) PCL+FRCNN (Tang et al. 2020) C-MIL+FRCNN (Wan et al. 2019) C-MIDN+FRCNN (Gao et al. 2019) OIM+FRCNN (Lin et al. 2020) SLV+FRCNN (Chen et al. 2020) GradingNet-OICR (ours) GradingNet-PCL (ours) GradingNet-MELM (ours) GradingNet-C-MIL (ours) aero 58.5 57.1 55.6 62.5 65.1 53.3 55.6 65.6 65.5 63.2 61.8 54.1 53.4 62.1 63.2 60.3 57.3 61.8 boat 16.9 16.9 29.1 32.1 39.2 26.1 27.9 37.1 21.6 22.6 28.9 26.4 26.0 34.5 23.4 25.2 31.6 39.2 bottle 17.4 18.8 16.4 19.8 24.3 20.3 21.1 24.6 22.1 27.3 18.9 22.2 27.7 25.6 23.2 21.1 31.0 31.6 car 60.8 63.7 68.1 66.1 66.2 69.9 68.3 70.3 68.5 69.1 69.6 68.9 69.7 67.4 68.3 64.6 71.8 75.0 chair 8.2 17.0 25.0 20.0 29.8 28.7 21.3 30.8 5.7 12.0 18.5 25.2 21.4 24.2 10.6 15.0 29.0 33.9 dog 31.3 33.2 53.2 53.5 60.1 64.6 54.5 61.4 30.3 37.3 66.9 70.3 63.7 71.6 41.3 54.0 65.6 61.9 horse 51.9 54.4 49.6 57.4 71.2 58.0 56.5 65.3 64.7 63.3 65.9 66.3 63.7 72.0 60.1 58.8 71.0 71.3 mbike 64.8 68.3 68.6 68.9 70.7 71.2 70.1 68.4 66.1 63.9 65.7 67.5 67.4 67.2 68.2 65.4 68.7 64.4 person 13.6 16.8 2.0 8.4 21.9 20.0 12.5 12.4 13.0 15.8 13.8 21.6 10.9 12.1 20.2 22.7 29.2 28.2 plant 23.1 25.7 25.4 24.6 28.1 27.5 25.0 29.9 25.6 23.6 22.9 24.4 25.3 24.6 22.9 20.1 24.1 33.2 mAP 42.0 45.8 47.3 50.5 53.6 52.6 50.1 53.5 47.0 48.8 53.1 53.6 52.6 53.9 48.1 49.4 52.5 54.3 Table 2: Average precision (%) on the PASCAL VOC 2007 test set. Method OICR (Tang et al. 2017) PCL (Tang et al. 2020) MELM (Wan et al. 2018) C-MIL (Wan et al. 2019) WSOD2 (Zeng et al. 2019) C-MIDN (Gao et al. 2019) OIM (Lin et al. 2020) SLV (Chen et al. 2020) OICR+FRCNN (Tang et al. 2017) PCL+FRCNN (Tang et al. 2020) C-MIDN+FRCNN (Gao et al. 2019) OIM+FRCNN (Lin et al. 2020) GradingNet-OICR (ours) GradingNet-PCL (ours) GradingNet-MELM (ours) GradingNet-C-MIL (ours) VOC 07 CorLoc(%) 61.2 63.0 61.4 65.0 69.5 68.7 67.2 71.0 64.3 66.6 71.9 68.8 68.6 69.1 63.2 72.1 mAP 38.2 41.6 42.4 46.7 47.2 50.2 45.3 49.2 42.5 44.2 50.3 46.4 44.3 47.0 48.6 50.5 VOC 12 CorLoc(%) 63.5 65.0 67.4 71.9 71.2 67.1 69.2 65.6 68.0 73.3 69.5 68.9 69.3 62.8 71.9 Table 3: CorLoc (%) on the PASCAL VOC 2007 trainval set, mAP (%) and CorLoc (%) on the PASCAL VOC 2012 test and trainval sets. Figure 5: Example results by GradingNet-C-MIL and CMIL. Green/purple boxes indicate correct/failure cases by GradingNet-C-MIL, and red ones indicate cases by C-MIL. Qualitative Results and Discussion Method PCL (Tang et al. 2020) C-MIDN (Gao et al. 2019) WSOD2 (Zeng et al. 2019) OICR+FRCNN (Tang et al. 2017) PCL+FRCNN (Tang et al. 2020) GradingNet-OICR (ours) GradingNet-PCL (ours) GradingNet-MELM (ours) GradingNet-C-MIL (ours) AP 8.5 9.6 10.8 7.7 9.2 8.7 10.8 11.0 11.6 Figure 5 exhibits some cases of GradingNet. We can obverse that GradingNet can well contain the whole object, while there remains a challenge to solve the detection problem in close or overlapping objects. It is worth mentioning that for the ”person” class, our detection performance is better than most weakly-supervised object detectors. The reason is that the Inclusion Principle we proposed caters to the structure of the human body. In addition, to the best of our knowledge, our GradingNet is a rare framework that focuses on the better usage of candidate bounding boxes generated by standard one-stage weakly-supervised methods. This field that traditional WSOD works ignore is worth studying. AP50 19.4 21.4 22.7 17.4 19.6 20.1 22.9 22.6 25.0 Table 4: AP and AP50 on the COCO 2014 validation set. Conclusion In this paper, we propose a novel framework GradingNet, which regards the classical WSOD problem as a preliminary WSOD problem and a mixed problem of incomplete and inaccurate supervision. We deal with the former problem through the Boxes Grading Module (BGM) and tackle the latter problem through the Informative Boosting Module (IBM). The proposed GradingNet achieves state-of-the-art performance on COCO, VOC 07 and VOC 12 benchmarks. MS COCO: We also report the results on COCO 2014 validation set in Table 4. It is worth mentioning that a few methods report results on COCO dataset. Nevertheless, we present the performance of our GradingNet on four baselines. All the four methods we present achieve high performance and our GradingNet-C-MIL achieve 11.6% AP and 25.0% AP50 , which create a new state-of-the-art. 1688 Acknowledgments Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In ECCV, 740–755. Springer. Miller, K.; Kumar, M. P.; Packer, B.; Goodman, D.; and Koller, D. 2012. Max-margin min-entropy models. In AISTATS, 779–787. Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In CVPR, 779–788. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards real-time object detection with region proposal networks. In NIPS, 91–99. Song, H. O.; Lee, Y. J.; Jegelka, S.; and Darrell, T. 2014. Weakly-supervised discovery of visual pattern configurations. In NIPS, 1637–1645. Tang, P.; Wang, X.; Bai, S.; Shen, W.; Bai, X.; Liu, W.; and Yuille, A. 2020. PCL: Proposal Cluster Learning for Weakly Supervised Object Detection. T-PAMI 42(1): 176–191. Tang, P.; Wang, X.; Bai, X.; and Liu, W. 2017. Multiple instance detection network with online instance classifier refinement. In CVPR, 2843–2851. Tang, P.; Wang, X.; Wang, A.; Yan, Y.; Liu, W.; Huang, J.; and Yuille, A. 2018. Weakly supervised region proposal network and object detection. In ECCV, 352–368. Wan, F.; Liu, C.; Ke, W.; Ji, X.; Jiao, J.; and Ye, Q. 2019. C-MIL: Continuation multiple instance learning for weakly supervised object detection. In CVPR, 2199–2208. Wan, F.; Wei, P.; Jiao, J.; Han, Z.; and Ye, Q. 2018. Minentropy latent model for weakly supervised object detection. In CVPR, 1297–1306. Wang, J.; Chen, K.; Yang, S.; Loy, C. C.; and Lin, D. 2019. Region proposal by guided anchoring. In CVPR, 2965– 2974. Wang, X.; Zhu, Z.; Yao, C.; and Bai, X. 2015. Relaxed multiple-instance SVM with application to object discovery. In ICCV, 1224–1232. Wei, Y.; Shen, Z.; Cheng, B.; Shi, H.; Xiong, J.; Feng, J.; and Huang, T. 2018. Ts2c: Tight box mining with surrounding segmentation context for weakly supervised object detection. In ECCV, 434–450. Yang, K.; Li, D.; and Dou, Y. 2019. Towards precise end-toend weakly supervised object detection network. In ICCV, 8372–8381. Ye, Q.; Zhang, T.; Ke, W.; Qiu, Q.; Chen, J.; Sapiro, G.; and Zhang, B. 2017. Self-learning scene-specific pedestrian detectors using a progressive latent model. In CVPR, 509– 518. Yu, C.-N. J.; and Joachims, T. 2009. Learning structural svms with latent variables. In ICML, 1169–1176. Zeng, Z.; Liu, B.; Fu, J.; Chao, H.; and Zhang, L. 2019. Wsod2: Learning bottom-up and top-down objectness distillation for weakly-supervised object detection. In ICCV, 8292–8300. This work is supported in part by National Key Research and Development of China (2017YFC1703503), in part by National Natural Science Foundation of China (61972022, 61532005, U1936212), and in part by the Fundamental Research Funds for the Central Universities(2018JBZ001) . References Bilen, H.; Pedersoli, M.; and Tuytelaars, T. 2015. Weakly supervised object detection with convex clustering. In CVPR, 1081–1089. Bilen, H.; and Vedaldi, A. 2016. Weakly supervised deep detection networks. In CVPR, 2846–2854. Bouchacourt, D.; Nowozin, S.; and Pawan Kumar, M. 2015. Entropy-based latent structured output prediction. In ICCV, 2920–2928. Chen, Z.; Fu, Z.; Jiang, R.; Chen, Y.; and Hua, X.-S. 2020. SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection. In CVPR, 12995–13004. Cinbis, R. G.; Verbeek, J.; and Schmid, C. 2017. Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning. T-PAMI 39(1): 189–203. Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 379–387. Everingham, M.; Eslami, S. A.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2015. The pascal visual object classes challenge: A retrospective. IJCV 111(1): 98–136. Gao, Y.; Liu, B.; Guo, N.; Ye, X.; Wan, F.; You, H.; and Fan, D. 2019. C-MIDN: Coupled Multiple Instance Detection Network With Segmentation Guidance for Weakly Supervised Object Detection. In ICCV, 9834–9843. Girshick, R. 2015. Fast r-cnn. In ICCV, 1440–1448. Gokberk Cinbis, R.; Verbeek, J.; and Schmid, C. 2014. Multi-fold mil training for weakly supervised object localization. In CVPR, 2409–2416. Hoffman, J.; Pathak, D.; Darrell, T.; and Saenko, K. 2015. Detector discovery in the wild: Joint multiple instance and representation learning. In CVPR, 2883–2891. Kantorov, V.; Oquab, M.; Cho, M.; and Laptev, I. 2016. Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV, 350–365. Springer. Li, D.; Huang, J.-B.; Li, Y.; Wang, S.; and Yang, M.-H. 2016. Weakly supervised object localization with progressive domain adaptation. In CVPR, 3512–3520. Lin, C.; Wang, S.; Xu, D.; Lu, Y.; and Zhang, W. 2020. Object Instance Mining for Weakly Supervised Object Detection. In AAAI, 11482–11489. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; and Belongie, S. 2017. Feature pyramid networks for object detection. In CVPR, 2117–2125. 1689 Zhang, Y.; Bai, Y.; Ding, M.; Li, Y.; and Ghanem, B. 2018. W2f: A weakly-supervised to fully-supervised framework for object detection. In CVPR, 928–936. Zhou, Z.-H. 2018. A brief introduction to weakly supervised learning. National Science Review 5(1): 44–53. 1690