Faster RCNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks^[1]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
arXiv:1506.01497

official MATLAB
official python
tensorflow ver.

(objectness score를 학습하는) RPN(Region Proposal Network)을 제안한다.
Fast RCNN에서 보았듯, region proposal computaion이 (test time의) 병목이다. CNN으로 이루어진 RPN을 학습시켜서 이미지 한장당 300개정도의 proposal만으로 state-of-the-art accuracy를 얻는다. RPN은 기존의 Fast RCNN과 parameter를 share한다.

보통 Selective Search^[2]를 쓰지만, CPU만 사용할 경우 이미지 한장당 2초정도 걸릴 정도로 매우 느리다. 최근 장당 0.2초 걸리는 EdgeBoxes^[3]라는 것도 나왔지만 그래도 상당한 시간이 소요된다. convolution을 share하는 RPN을 제안할건데 이거 쓰면 marginal cost는 장당 10ms정도다.(전체 과정 다 해도 VGG-16에서 5 fps on GPU, ZF net에서 17 fps)

기존 Fast RCNN에 몇개의 conv net을 붙여서 RPN을 만들고, regular grid에 대해 objectness를 계산할 것이다. 따라서 RPN도 FCN^[4]의 일종이다.

preliminary version이 공개된 이후 여기저기 많이 쓰였고 상용으로 Pinterest^[5]에도 쓰였다. ILSVRC, CC 2015 competition에서 상당수 1위들이 모두 Faster R-CNN과 RPN based다.^[6]

RPN은 작은 net으로, Fast R-CNN 의 마지막 conv를 sliding window fashion으로 검사한다. 실험에는 $3 \times 3$이 쓰였는데, 이정도면 receptive field가 충분히 넓다. 이렇게 해서 나온 region proposal을 바탕으로 convolutional feature map(‘Fast R-CNN 의 마지막 conv’의 출력)을 선택하여 box-regressor(reg)와 box-classifier(cls)로 보낸다(둘 다 2-layer-FC). 실제로는 모든 sliding window에 대해 동시에 계산하고, 각각이 최대 k개의 proposal을 가진다. 이 각각을 anchor box라고 한다. anchor box의 중앙은 sliding window의 중앙으로 정렬되고, scale과 aspect ratio정보를 가지고 있다. 실험에는 $k=9$( scale 3 종류 × aspect ratio 3종류. 각각 $128^2, 256^2, 512^2, 1:1, 1:2, 2:1$. 뒤에 PASCAL VOC 실험할 때는 object size가 작은 것을 고려해서 $64^2$도 추가했다. aspect ratio를 1:1로 고정하고 scale만 세종류 해도 맞먹는 결과를 얻을 수 있으나 system의 flexibility를 위해 ratio도 세 종류 쓰게 만들었다)가 쓰였다.(이미지의 짧은 쪽을 600px로 resize하고 stride를 16px정도로 주었으므로 1000 × 600 이미지의 경우 대략 anchor가 2만개정도. ($\approx 1000/16 × 600/16 × 9 $ $\approx 60 × 40 × 9$). 그러나 cross-boundary 때문에 여기서 6천개정도만 쓰인다. 나중에 test time에는 cross boundary가 예측되기도 하는데 이는 cnn의 receptive field가 image boundary에 걸쳐있기 때문. 걸러진 6천개도 모두 쓰이는 것이 아니고 $cls$ score > 0.7이상인 것들 대상으로 NMS(non-maximum suppression)적용하고 2천개정도만 남음.)

Multi-box^[7]와 달리 translation-Invariant하다. Multi-Box는 800개의 anchor를 생성하기 위해 k-means를 쓴다. sliding window based method는 모두 translation invariant할 것 같은데. 별로 중요하지 않은듯

Multi-scale detection을 위해 보통 ① image pyramid를 쓰거나, ② filter pyramid(eg. various sized CNNs) 를 쓰거나, 이 둘의 혼합을 쓴다. 우리의 method는 pyramid of anchors로 볼 수 있다(다양한 사이즈와 aspect ratio의 anchor들을 쓰기 때문에). Pyramid of anchors는 이미지도 single size과 filter도 single size이기 때문에 둘과는 구분된다.

RPN training할 때, 각 anchor에 binary class label(objectness, 0/1)을 달아준다. Anchor가 ① ground truth와 IoU > 0.7일때와, ② 단순히 IoU가 최댓값일 때. ②는 가끔 ①만으로는 해당되는(class=1) case가 없기 때문에 넣어준다. Loss는 다음과 같다. $$ L({p_i}, {t_i}) = \frac{1}{N_\text{cls}}\sum_i L_\text{cls}(p_i, p_i^*) + \lambda\frac{1}{N_\text{reg}}\sum_i p_i^* L_\text{reg}(t_i, t_i^*)$$ ( $i$ : index of an anchor in a mini-batch, $p_i$ : prob of objectness (of an anchor), $p_i^*$ : ground-truth label(objectness), $0$ or $1$, $t_i$ : 4 parameterized coordinates^[8] of the predicted bbox, $L_\text{reg}$ is smooth $\text{L}_1$ ^[9], $N_\text{cls}$ : mini-batch size, $N_\text{reg}$ : number of anchor locations)
실험에서는 $N_\text{cls}=256$, $N_\text{reg}\approx 2400$, balancing parameter $\lambda=10$이 쓰였다. $\lambda=10$는 Loss의 두 term이 거의 비슷하게 되도록 잡은 것인데, 실제로 여러 값으로 실험해보면 크게 중요하지 않음을 알 수 있다.^[10] Normalization($N_\text{cls}$, $N_\text{reg}$등 값)도 크게 중요하지 않았다.
Negative sample($p_i^* = 0$)이 dominant하기 때문에, positive : negative가 최대 1:1이 되도록 맞추었다. 다만, positive가 절반이 되지 않으면(=mini-batch 256일 때, 128이 되지 않으면) negative로 채웠다.

새로 도입된 모든 layer는 zero-mean Gaussian dist($\sigma = 0.01$)로 초기화되었고, shared convolution등 모든 layer는 pretraining된 것을 썼다. lr$=0.001$ for 60k mini-batches and $0.0001$ for the next 20k on the PASCAL VOC, momentum $0.9$, weight decay $0.0005$. Caffe로 구현.

학습할 때, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning할 수 있고(‘Alternating training’), 아예 end-to-end로(‘Approximate joint training’) 할 수 있는데^[11] 논문은 (pragmatic) 4-step alternating training제안. ① 먼저 ‘pretraining된 net + RPN에 덧붙여진 CNN’으로 RPN학습 시키고, ② 이 RPN을 바탕으로 (역시 pretraining된 net인) detection net(여기서는 Fast R-CNN)을 fine-tuning. ③ 다시 이 fine-tuned net과 위에 ‘RPN에 덧붙여진 CNN’을 붙여서 RPN학습(여기서부터 conv net을 share하게 된다). ④ 이 상태에서 detection net다시 학습.

pretrained net으로는 ZF(fast version. 5 conv + 3 fc ^[12])와 VGG-16(13 conv + 3 fc^[13]) 사용.

VGG-16을 ResNet으로만 바꾸어도 성능의 향상이 있다.

↑ Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
↑ J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision (IJCV), 2013.
↑ C. L. Zitnick and P. Dolla ́r, “Edge boxes: Locating object proposals from edges,” in European Conference on Computer Vision(ECCV),2014.
↑ J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
↑ D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human cu- ration and convnets: Powering item-to-item recommendations on pinterest,” arXiv:1511.04003, 2015.
↑ K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.
↑ C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, high-quality object detection,” arXiv:1412.1441 (v1), 2015.
↑ R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
↑ R. Girshick, “Fast R-CNN,” in IEEE International Conference on Computer Vision (ICCV), 2015.
↑

$\lambda$ 0.1 1 10 100

mAP(%) 67.2 68.9 69.9 69.1

그런데 이정도면 $\lambda=0$일 때도 보아야 하는것 아닌가?
↑ 논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” arXiv:1512.04412, 2015. 참고하라고 써있음.
↑ M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014.
↑ K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.

blog comments powered by Disqus

[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

[r4-2] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision (IJCV), 2013.

[r6-3] C. L. Zitnick and P. Dolla ́r, “Edge boxes: Locating object proposals from edges,” in European Conference on Computer Vision(ECCV),2014.

[r7-4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[r17-5] D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human cu- ration and convnets: Powering item-to-item recommendations on pinterest,” arXiv:1511.04003, 2015.

[r18-6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.

[r27-7] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable, high-quality object detection,” arXiv:1412.1441 (v1), 2015.

[r5-8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[r2-9] R. Girshick, “Fast R-CNN,” in IEEE International Conference on Computer Vision (ICCV), 2015.

[10] 

\(\lambda\) 0.1 1 10 100

mAP(%) 67.2 68.9 69.9 69.1

그런데 이정도면 \(\lambda=0\)일 때도 보아야 하는것 아닌가?

[11] 논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” arXiv:1512.04412, 2015. 참고하라고 써있음.

[r32-12] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014.

[r3-13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

\(\lambda\)	0.1	1	10	100
mAP(%)	67.2	68.9	69.9	69.1