[Object detection] YOLO9000:Better, Faster, Stronger 번역 및 정리 (YOLOv2)

Paper review/Object Detection 2021. 11. 22. 15:39

@ 굵은 글씨는 중요한 내용, 빨간 글씨는 내가 추가한 내용

https://arxiv.org/pdf/1612.08242.pdf

주요 기여:

YOLO를 보완하여 정확도와 속도를 높인 YOLOv2 제안
9000 카테고리를 구분할 수 있는 YOLO9000 제안 (기존 detection dataset인 coco는 80 class)

0. Abstract

기존 YOLO에 대한 다양한 개선을 제안한 YOLOv2는 속도 및 성능 측면에서 모두 SOTA를 달성한다 (67fps에서 76.8mAP, 40fps에서 78.6mAP).

object detection 및 classification에서 jointly train 방법을 제안한다. COCO dataset과 ImageNet dataset을 동시에 학습한다. 이를 통해 labeling되지 않은 object class에 대한 detection을 수행할 수 있다.

실시간으로 9000개 이상의 카테고리를 탐지할 수 있다.

1. Introduction

Object detector는 빠르고 정확하고 다양한 object를 detect할 수 있어야 한다. Object detection은 신경망 도입 이후 더 빠르고 더 정확해졌지만, 작은 object 카테고리로 제한되었다.

Object detection은 Classification에 비해 제한적이다 (detection dataset은 수백 개의 카테고리, classification은 수만~수십만개의 카테고리)
이에 따라 우리는 object를 classification 수준으로 확장하고자 한다. 하지만 ojbect detection labeling image는 classification에 비해 비용이 높아 classification 수준의 dataset은 보기 어렵다.

따라서 우리는 classification dataset을 계층적으로 활용하여 확장하고자 한다.

또한 detection 및 classification data 모두에서 object detector를 학습시킬 수 있는 공동 학습 알고리즘을 제안한다. Detection data를 통해 object localization을 학습하는 동시에, classification data를 통해 vocabulary와 robustness을 높인다.

이를 통해 9000개 이상의 카테고리 object detection이 가능한 YOLO9000을 제안한다.

먼저 YOLO를 개선하여 YOLOv2를 제안하고, 공동 학습 알고리즘을 통해 YOLO9000을 제안한다.

코드 및 사전 학습 모델은 http://pjreddie.com/yolo9000/에서 확인할 수 있다.

2. Better

YOLO는 다양한 단점이 존재한다. Faster R-CNN과 비교하여 YOLO는 많은 localization error (bounding box 좌표 error)를 발생시킨다. 또한 region based 방법에 비해 recall이 낮다 (한 image에서 49개의 box만 예측할 수 있다). 이에 따라 localization error와 recall을 개선하는 데에 집중한다.

Recall

Recall (재현율) = 실제 True인 것 중에서 모델이 True라고 예측한 비율 =TP/(TP+FN)

Precision (정밀도) = 모델이 True라고 분류한 것 중에서 실제 True인 것의 비율 = TP/(TP+FP)

YOLOv2는 보다 빠르고 정확하며, 이를 위해 네트워크를 단순화하고 표현을 더 쉽게 학습할 수 있도록 한다. YOLO와의 비교 결과는 아래 표에서 확인할 수 있다.

Batch Normalization.

배치 정규화를 통해 다른 정규화를 제거하면서 수렴을 크게 개선한다. YOLO의 모든 컨볼루션 레이어에 배치 정규화를 추가하면 mAP가 2% 이상 개선된다. 배치 정규화는 모델을 정규화하는 데도 도움이 된다. 배치 정규화를 사용하면 overfitting 없이 모델에서 dropout을 제거할 수 있다.

High Resolution Classifier.

기존 대부분의 모델은 분류기를 낮은 해상도에서 학습한 후 검출기를 새로운 입력 해상도에 맞춰 조정해야 한다. YOLOv2는 먼저 ImageNet의 10개 epoch에 대해 전체 448 × 448 해상도로 분류기를 미세 조정한다. 그런 다음 검출기를 미세 조정한다. 이를 통해 고해상도 분류 네트워크는 거의 4%의 mAP 증가를 제공한다.

Convolutional With Anchor Boxes.

YOLO는 conv feature map에 FC layer를 사용하여 bounding box의 좌표를 예측한다. 우리는 YOLO에서 FC layer를 제거하고 anchor box를 사용하여 bounding box를 예측한다. 이때, input image는 416 x 416이고 32배 다운 샘플링하여 13 x 13 feature map을 획득한다.

Anchor box 각각에 대해 class와 object를 예측한다 (object 예측은 IoU값, class 예측은 object가 존재할 떄 class의 조건부 확률값).

Anchor box를 사용하면 정확도는 약간 감소하지만 recall이 증가한다 (기존 recall 81%에서 69.5mAP, anchor box 사용 후 recall 88%에서 69.2 mAP).

Dimension Clusters.

Anchor box를 사용할 때, 두 가지 문제가 발생한다. 하나는 box의 크기를 수작업으로 정해야 한다는 것이다.

수작업으로 하는 대신 train dataset에서 k-means clustering을 수행하여 최적 box 크기를 정한다. 이때, 측정 기준은 아래 식을 사용한다.

밑의 표를 통해 Cluster IoU를 통해 선택한 5개의 box가 9개의 anchor box와 유사하게 작동하는 것을 확인할 수 있고, 9개의 Cluster IoU box에 대해서는 더 높은 Avg IoU를 확인할 수 있다.

Direct location prediction.

Anchor box를 사용할 때, 발생하는 두 번째 문제는 초기 iteration 중에 box의 위치를 예측할 때의 불안정성이다. Region proposal network에서는 중심좌표(x,y)가 다음과 같이 계산된다 (ex. tx=1인 경우 anchor box 너비만큼 오른쪽 이동).

이는 제한이 없어 예측 상자 위치에 관계 없이 어느 위치든 가능하여 안정화하는데 오래 걸린다 ((1,1)위치에서 예측했더라도 tx가 크면 (400,400)의 위치도 가능하다는 말).
이를 해결하기 위해 그리드 셀 위치에 대한 상대적인 위치 좌표를 예측한다.

네트워크는 feature map의 각 셀에서 bounding box 5개를 예측한다. 네트워크는 각 경계 상자, tx, ty, tw, th 및 to에 대해 예측한다. 셀이 이미지의 왼쪽 상단 모서리에서 (cx, cy)로 오프셋되고 이전의 경계 상자에 너비와 높이 pw, ph가 있는 경우 예측값은 다음과 같다.

위치 예측을 제한하기 때문에 매개 변수화가 학습이 쉬워 네트워크가 더 안정적이다. Dimension cluster와 direct location prediction을 통해 YOLO를 거의 5% 향상시킨다.

Fine-Grained Features.

기존 7x7 feature map에서 13x13 feature map으로 크기를 변경한다. 또한, pass through layer를 추가하여 통해 더 작은 object에 대한 detection 성능을 높일 수 있다 (1% 성능 향상).

Multi-Scale Training.

다양한 input 이미지에 대해 강건한 모델로 학습하기 위해 input image 크기를 고정하는 대신 10개의 배치마다 무작위의 image크기를 선택한다 (32배만큼 downsampling을 수행하기 때문에, 32배수로 추출(320,352, ..., 608)한다.). 이를 통해 다양한 해상도에서도 detection을 수행할 수 있다.

저해상도에서는 288x288로 91FPS, 69.0mAP를 달성하고 이는 Fast R-CNN의 정확도와 비슷하다.

고해상도에서는 544x544로 40FPS, 78.6mAP를 달성하고 이는 SOTA이다.

Further Experiments.

아래 표는 VOC2012 dataset에서의 YOLOv2와 다른 detector의 비교 결과이다. YOLOv2가 훨씬 빠르게 실행되지만 높은 정확도 (73.4mAP)를 달성하는 것을 확인할 수 있다.

ㅁㄴㅇㄹ아래 표는 COCO dataset에서의 YOLOv2와 다른 detector의 비교 결과이다. IOU = .5에서Faster R-CNN과 유사한 44.0mAP를 달성하는 것을 확인할 수 있다.

3. Faster

자율주행과 같은 detection을 위해 빠른 성능의 YOLOv2를 설계한다.

대부분의 detection 프레임워크는 VGG-16에 의존하지만 복잡성으로 인해 속도가 저하된다는 문제점이 있다. 이에 따라 우리는 Googlenet 아키텍처에 기반한 custom 네트워크를 사용한다. 이는 VGG-16보다 빠르지만 정확도가 약간 저하된다 (single-crop, top-5 accuracy at 224 × 224에서 88.0% 달성 / VGG-16는 90.0%).

밑의 내용은 구현에 관련된 자세한 설명이여서 번역을 생략함.

Darknet-19.

We propose a new classification model to be used as the base of YOLOv2. Our model builds off of prior work on network design as well as common knowledge in the field. Similar to the VGG models we use mostly 3 × 3 filters and double the number of channels after every pooling step [17]. Following the work on Network in Network (NIN) we use global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions [9]. We use batch normalization to stabilize training, speed up convergence, and regularize the model [7].

Our final model, called Darknet-19, has 19 convolutional layers and 5 maxpooling layers. For a full description see Table 6. Darknet-19 only requires 5.58 billion operations to process an image yet achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet.

Training for classification.

Train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework. During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.

As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448. For this fine tuning we train with the above parameters but for only 10 epochs and starting at a learning rate of 10−3 . At this higher resolution our network achieves a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%.

Training for detection.

We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features.

We train the network for 160 epochs with a starting learning rate of 10−3 , dividing it by 10 at 60 and 90 epochs. We use a weight decay of 0.0005 and momentum of 0.9. We use a similar data augmentation to YOLO and SSD with random crops, color shifting, etc. We use the same training strategy on COCO and VOC.

4. Stronger

Classification 및 detection data에 대한 공동 train 매커니즘을 제안한다. 학습 중 detection 및 classification dataset의 이미지를 혼합한다.
Detection dataset(COCO)은 보통 'dog'과 같은 일반적인 label인 반면 classification dataset(ImageNet)은 'Norfolk terrier', 'Yorkshire terrier', 'Bedlington terrier'와 같은 많은 종류의 dog를 포함한다.
그리고 Classification의 최종 output은 모든 카테고리에 대해 softmax layer를 거쳐 결과를 얻는다. Softmax는 각 카테고리가 상호배타적이라고 가정한다 (Ex. "Norfolk terrier"와 "dog"는 상호배타적이지 않기 때문에 ImageNet과 COCO를 결합할 수 없다.).
대신 다중 label 모델을 사용하여 상호 배타적이라고 가정하지 않는 dataset을 결합할 수 있고, 이는 기존 dataset의 구조와 매우 다르다 (ex. 모든 coco class는 상호 배타적임).

Hierarchical classification.

ImageNet label은 WordNet에서 가져왔고, 전체 구조를 사용하기 보단 hierarchical tree를 구축하여 문제를 단순화 한다. 이를 통해 시각적 개념의 계층적 모델인 WorkTree를 구축한다. 우리는 다음과 같이 조건부 확률을 통해 예측한다.

특정 노드의 확률을 계산하려면 다음과 같이 조건부 확률을 곱하여 예측할 수 있다.

WordTree를 통한 prediction은 아래 그림을 통해 확인할 수 있다.

Dataset combination with WordTree.

WordTree를 사용하여 dataset을 결합한다. 아래 그림은 ImageNet과 COCO label을 결합하는 예시이다.

Joint classification and detection.

WordTree를 통해 9418개의 카테고리가 있는 dataset을 생성한다. 이 중 9000개의 카테고리가 ImageNet에 포함되기 때문에 dataset의 균형을 맞추기 위해 4:1비율로 맞추어 학습한다. 또한, 학습 시 detection 이미지의 loss는 기존과 동일하게 역전파가 수행되고, classification 이미지는 classification loss만 역전파가 수행된다 ~~(Ex. label이 'dog'라면 detection dataset은 'dog'까지만의 조건부 확률을 사용하여 loss를 구한다.).~~

이러한 joint training 방식을 통해 9000개 이상의 카테고리를 분류하는 방법을 배운다.

5. Conclusion

본 논문은 YOLOv2와 YOLO9000을 제안한다. YOLOv2는 SOTA를 달성하고, 다양한 image 크기로 수행되어 속도와 정확도 간의 적절한 trade-off를 제공한다.

YOLO9000은 detection 및 classification을 공동 학습하여 9000개 이상의 카테고리를 detection하기 위한 실시간 프레임워크이다. WordTree를 통해 공동 학습을 수행한다. 이를 통해 detection과 classification 사이의 dataset 크기의 격차를 크게 줄인다.

계층 구조를 활용한 data 활용은 segmentatino task에도 활용될 수 있고, multi-scale training과 같은 기술은 다른 computer vision에서도 활용될 수 있을 거라 기대한다.

'Paper review > Object Detection' 카테고리의 다른 글

[Object detection] End-to-End Object Detection with Transformers 번역 및 정리 (DETR) (0)	2021.11.03
[Object Detection] Rich feature hierarchies for accurate object detection and semantic segmentation 번역 및 정리 (R-CNN) (0)	2021.08.03
[Object Detection] YOLO v1: You Only Look Once: Unified, Real-Time Object Detection 번역 및 정리 (1) (0)	2021.04.27

ABOUT ME

실현하깃 실현하깃

0. Abstract

1. Introduction

2. Better

3. Faster

4. Stronger

5. Conclusion

'Paper review > Object Detection' 카테고리의 다른 글

티스토리툴바

ABOUT ME

0. Abstract

1. Introduction

2. Better

3. Faster

4. Stronger

5. Conclusion

'Paper review > Object Detection' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바