YOLO9000: Better, Faster, Stronger

PDF arXiv

Joseph Redmon, Ali Farhadi

CVPR 2017, Best Paper Honorable Mention

CVPR 2017

Assigned_Reviewer_2

Paper Summary

The paper introduce yolo9000, an improvement on the original yolo detector.

The detection improvements comes from following: 1. Using anchor box instead of original grid based approach, the anchor size is chosen using k-mean clustering, instead of hand picking 2. Fusing low level feature with high level feature.The low level feature is reshaped to be of the same size (down sample) as high level feature, 3. Multi scale training. Similar to data augmentation like random crop, make network robust against different object scales. Compared to random crop, this approach enables us to augment smaller size object more easily.

The detector is also able to detect more than 9000 class, since it is jointly trained with ImageNet classification dataset. During training the classification label, it simply back propagates loss through the box with highest probability. Class labels are organized as tree, allowing the label sharing between detection and classification, since detection usually has coarse grained label (parent node in the tree).

Paper Strengths.

Novelty includes: 1. Co-training of both detection and classification task, using hierarchical tree representation for class labeling. The result is impressive, the network can detect more than 9000 classes in the detection task. 2. Introduce new techniques like automatic anchor sizing, feature fusion, multi-Scale training, which improves the detection performance.

Results: The result on VOC2012 is impressive, faster compared with similar performance models

Paper Weaknesses.

The evaluation result on COCO isn't impressive.
The detection of performance can be further improved if using more feature maps from different layers. Only using last layer (13by13 down sample by 32) may effect performance of detecting smaller object. The solution mentioned in paper is to ‘down sample’ the low level feature, which in a sense loose the high resolution advantage (useful for detecting small objects). The alternative may include up-sample the high level feature and then combined with low level feature, which would be more effective.

Preliminary Rating.

Weak Accept

Preliminary Evaluation.

This paper extends YOLO to detect 9000 classes of objects and proposes some interesting ideas for training detector. However, the result on COCO isn't impressive.

New exciting ideas.

Yes

Confidence.

Very Confident

Final Recommendation

oral/poster

Assigned_Reviewer_3

Paper Summary

This paper proposes an improved version of YOLO, YOLOv2, through a variety of techniques and a light-weighted architecture. YOLOv2 is the state-of-art detector and is much faster than other existing detectors. The paper further proposes to construct WordTree, a hierarchical model of visual concepts, for performing multiple softmax operations. This allows different sources of dataset to be jointly trained. With this method, YOLO9000 can detect more than 9000 classes of objects.

Paper Strengths.

There are two main strengths. First, the proposed detector adopts some non-trivial modifications and achieves the state of the art performances: YOLOv2 achieves the highest mean average precision while is much faster than Faster-RCNN and SSD. This is important for real-time applications. Second, the proposed hierarchical model of visual concepts is significant. It captures the structure of labels and allows the mixture of different datasets, even though labels from these datasets are not mutual exclusive.

Paper Weaknesses.

More details about the construction of the hierarchical model WordTree, as well as details of combining different datasets using WordTree are preferred. Also, more experiments for YOLO9000 are preferred.

Preliminary Rating.

Strong Accept

Preliminary Evaluation.

(Q1) Line 583-589, when the proposed network sees an image for detection, backpropogation is based on full loss function. When seeing an image for classification, backpropogation is only based on the classification-specific parts of the architecture. How could this be achieved? Detailed explanations are needed. (Q2) In Convolutional With Anchor Boxes of Section 2, why predicting offset simplifies the problem of predicting the coordinates of bounding boxes?

New exciting ideas.

Yes

Reproducibility.

Yes. Implementation details are sufficient and clear. Source code is also provided.

Confidence.###

Confident

Final comments:

Overall, this is a good paper. I am not changing my rating. The authors somewhat clarified my questions, although they did not address them in as much detail as I would have liked it. I do hope they release the code, but the details of eg tree construction should be described in the paper and not just rely on the algorithmics of the code.

Final Recommendation

oral/poster

Assigned_Reviewer_4

Paper Summary

This paper presents modifications to the original YOLO detection method which make it the fastest method with comparable performance of any of the leading methods (suc as SSD or faster r-cnn resnet) on a range of datasets (VOC 2007, VOC 2012, MS COCO). There are several incremental improvements including using batch normalization, fine tuning at high-resolution on imagenet and using fine grained features passed through from a higher level. The paper also describes several insightful improvements including intelligently choosing box dimension priors and predicting locations from a well constrained parameterization. On top of all this (and I've left out several other improvements), the paper also introduces a useful way to combine the massive amount of classification data into the detection framework using a hierarchical classification scheme.The final results can detect 9000 types of objects although not surprisingly, at this very fine granularity, performance is not stellar.

Paper Strengths.

There are several excellent novel technical improvements to object detection which retain real-time speed and state of the art performance. The paper describes a new and interesting way to combine detection and classification training data for detection of 9000 objects - many of which have no ground truth localization.

Paper Weaknesses.###

The paper is very well written, the results are very impressive, particularly the speed. I don't really think the paper has any significant weaknesses. As mentioned in the paper, the mAP for detection of many objects is fairly low but is based on little relevant training data. It might be nice to include more explicit information about the running speed of YOLO9000.

Preliminary Rating.

Strong Accept

Preliminary Evaluation.

Strong, clear paper with excellent results.

New exciting ideas.

Yes

Reproducibility.###

code will be available.

Confidence.

Very Confident

Final comments:

It was basically impossible to not know who the author was..

Final Recommendation

oral

Meta_Reviewer_1

The paper has several strengths and received a unanimous acceptance recommendation by the reviewers. The reviewers appreciated the impressive results with the network able to detect 9000 classes. Several algorithmic contributions are included such as feature fusion and multiscale training. YOLOv2 has the highest mean precision and is much faster than competing methods enabling real-time. The paper provides a way to combine detection and classification training data, cotraining for both detection and classification.