[Object Detection] Rich feature hierarchies for accurate object detection and semantic segmentation 번역 및 정리 (R-CNN)

Paper review/Object Detection 2021. 8. 3. 01:11

https://arxiv.org/pdf/1311.2524.pdf

주요 기여:

Object detection 분야에 첫 CNN 적용.
30%이상 성능 향상

0. Abstract

본 논문은 53.3%의 mAP 달성 object detection 알고리즘을 제안한다.

두 가지 핵심 insight는 다음과 같다.

object를 localize하고 segment하기 위해 high-capacity convolutional neural networks (CNNs) 적용.
labeing train data가 부족한 경우, auxiliary task에 대한 supervised pre-training과 domain-specific fine-tuning으로 성능 향상.

Region Proposal을 CNN과 결합하여 R-CNN: Regions with CNN features 이라 칭한다.

1. Introduction

그동안 여러 computer vision에서 SHIFT와 HOG알고리즘에 의존해왔다. 하지만 사소한 변형으로 발전은 느렸다.

본 논문은 CNN이 HOG와 같은 알고리즘을 기반으로 한 방법에 비해 훨씬 좋은 성능을 보일 수 있다는 것을 처음으로 보인다. 이를 위해 deep network로 localization을 수행하고, labeling된 data가 적은 high-capacity model을 train하는 것에 중점을 둔다.

보통 image classification과 달리 object detection은 이미지 내에서 object를 localization해야 한다. 우리는 input image에 대해 약 2000개의 region proposal을 생성하고, CNN을 사용하여 각 region에서 fixed-length feature vector를 추출한 후 class 별 linear SVM으로 각 region을 분류한다. 이때, 각 region proposal에서 fixed-length feature vector를 위해 region을 같은 크기로 warping 한다. 우리 시스템은 region proposal을 CNN과 결합하기 때문에, R-CNN: Regions with CNN features이라 칭한다.

R-CNN이 24.3%인 OverFeat 알고리즘에 비해 31.4%의 mAP로 성능을 훨씬 좋다는 것을 보여준다.

또한 본 논문은 domain 별 data가 부족한 상황에서 대용량 dataset을 훈련 후 fine tuning하는 것이 효과적이라는 것을 보인다. fine tuning 후 mAP 성능이 8% 가량 향상되었다.

class 별 계산은 matrix-vector product와 greedy non-maximum suppression 뿐으로 feature 크기를 줄여 속도를 향상시켰다.

약간의 수정을 통해 PASCAL VOC 2011 test set에서 segmentation을 수행한 결과 평균 segmentation 정확도는 47.9%를 기록했다.

2. Object detection with R-CNN

본 object detection 시스템은 3개의 모듈로 이루어져 있다.

bottom-up (selective search) region proposal 생성 (약 2000개)
각 region으로 부터 고정 길이의 feature vector를 추출하는 CNN
class 별 linear SVMs

아래 그림은 R-CNN을 보기 쉽게 시각화 해놓은 것이다.

2.1. Module design

Feature extraction. CNN을 통해 각 region proposal에서 4096차원의 feature vector를 추출한다. feature은 227x227 RGB image를 5개의 convolution layer와 2개의 fc layer를 forward propagating하여 계산된다.

2.2. Test-time detection

input image에 대해 selective search를 수행하여 약 2000개의 region proposal을 추출한다. 추출된 proposal을 고정된 크기로 warping 하여 CNN에 전달 하여 feautre vector를 추출한다. 다음 각 class에 대해 학습된 SVM을 사용하여 추출된 각 feature vector를 분류하여 각 object 확률 값을 score로 갖는다. greedy NMS(Non Maximum Suppression)을 통해 중복된 region proposal을 제거한다. (각 Ground Truth box에 대해 IoU가 임계값을 넘는 region proposal 중 score값이 가장 높은 region을 제외한 나머지를 제거하는 방식이다. IoU에 대한 개념은 다음 링크에 자세히 설명되어 있다.)

https://silhyeonha-git.tistory.com/3

[Deep Learning] IoU 개념 정리 (IoU, GIoU, DIoU, CIoU)

요약 IoU: 교집합 / 합집합 GIoU: 두 박스를 모두 포함하는 최소 영역인 C 박스 활용 DIoU: IoU와 중심점 좌표 함께 고려 CIoU: DIoU와 geometric measure 함께 고려 1. IoU (Intersection over Union) IoU란 In..

silhyeonha-git.tistory.com

Run-time analysis. 두 가지 속성을 통해 효율적으로 detection한다.

모든 CNN 파라미터는 모든 카테고리에 공유된다.
CNN을 통해 계산된 feature vector는 다른 방식에 비해 저차원이다. ( → 계산량 감소)

이는 region proposal 및 feature 계산 소요 시간을 줄인다(GPU - 13s/image 또는 CPU - 53s/image).

(image의 feature matrix는 2000x4096이고, SVM 가중치 matrix는 4096xN이다. 이때, N은 class 수를 의미한다.)

이러한 연산을 통해 class의 수를 획기적으로 늘릴 수 있다. (matrix multiplication의 빠른 속도를 통해).

2.3. Training

Domain-specific fine-tuning. 기존 classification로 학습되었던 CNN을 warped proposal region detection에 적응 시키기 위해 warped region proposal에서만 CNN을 학습한다.

Object category classifiers. feature 추출 후 class 당 하나의 SVM을 학습한다.

2.4. Results on PASCAL VOC 2010-12

R-CNN BB는 동일한 region propoal 알고리즘(selective search) 을 사용하는 UVA과 비교하여 mAP가 18.6%가량 향상되고, 속도 또한 훨씬 빠르다.

2.5. Results on ILSVRC2013 detection

ILSVRC 2013 competition 중 가장 성능이 뛰어난 것을 볼 수 있다.

3. Visualization, ablation, and modes of error

3.3. Network architectures

O-Net 사용 시 mAP는 58.5%에서 66.0%까지 향상되지만 시간이 약 7배더 걸리는 단점이 존재한다.

3.5. Bounding-box regression

Bounding box regression을 통해 mAP를 약 3~4% 향상시킨다.

Bounding-box regression

간단한 Bounding-box regression 단계를 통해 localization 성능을 개선한다. class 별 detection SVM을 통해 각 region proposal에 score를 매긴 후 class 별 bounding box regression을 사용하여 detection에 대한 새로운 bounding box를 예측한다.

input은 N개의 train 쌍 {(Pi, Gi)}(i=1, ...,N)의 집합이다. 여기서 Pi = (Pix, Pi, Pi, Pi)는 순서대로 region proposal의 center의 x 좌표, y 좌표, 폭, 높이이고, Ground-truth bounding box G = (Gx, Gy, Gw, Gh)도 순서대로 ground truth box의 center의 x 좌표, y 좌표, 폭, 높이이다. P box를 G box와 일치하도록 학습한다.

transformation을 4가지 함수 dx(P), dy(P), dw(P) 및 dh(P)로 파라미터화한다. dx(P)와 dy(P)는 P bounding box의 center에 대한 scale-invariant translation을 특정하고, dw(P)와 dh(P)는 P bounding box의 폭과 높이에 대한 로그 공간 변환을 특정한다. 이러한 feature를 학습한 후, 우리는 tranformation을 적용하여 input proposal P를 예측된 ground-truth G(hat)으로 변환할 수 있다.

d*(P)는 proposal P의 pool(5)의 linear 함수로 모델링 되며, φ5 (P)가 image data에 의존하는 것으로 가정된다. 따라서 d*(P) = wT(*)φ5 (P)를 가진다. w(*)는 학습가능한 model parameter이고, regularized least squares objective (ridge regression)를 최적화하여 w(*)를 학습한다.

train 쌍 (P, G)에 대한 regression 대상 t는 다음과 같다.

3.6. Qualitative results

6. Conclusion

복잡한 앙상블을 통해 object detection을 수행하였던 기존 방식에 비해 PASCAL VOC 2012에서 30%가량 향상된 단순한 object detection 알고리즘을 제안한다.

bottom-up region proposals에 cnn을 적용한 것과 data가 부족한 task에 대해 pretraining 후 fine-tuning을 수행한 것이 object detection 분야에 높은 기여를 했다고 생각된다.

object detection 분야에 처음으로 Deep learning을 적용하였다.

'Paper review > Object Detection' 카테고리의 다른 글

[Object detection] YOLO9000:Better, Faster, Stronger 번역 및 정리 (YOLOv2) (0)	2021.11.22
[Object detection] End-to-End Object Detection with Transformers 번역 및 정리 (DETR) (0)	2021.11.03
[Object Detection] YOLO v1: You Only Look Once: Unified, Real-Time Object Detection 번역 및 정리 (1) (0)	2021.04.27

ABOUT ME

실현하깃 실현하깃

0. Abstract

1. Introduction

2. Object detection with R-CNN

3. Visualization, ablation, and modes of error

6. Conclusion

'Paper review > Object Detection' 카테고리의 다른 글

티스토리툴바

ABOUT ME

0. Abstract

1. Introduction

2. Object detection with R-CNN

3. Visualization, ablation, and modes of error

6. Conclusion

'Paper review > Object Detection' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바