2017년 8월 7일 (월) 11:41 판

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks^[1]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
arXiv:1506.01497

official MATLAB
official python
tensorflow ver.

(objectness score를 학습하는) RPN(Region Proposal Network)을 제안한다.
Fast RCNN에서 보았듯, region proposal computaion이 (test time의) 병목이다. CNN으로 이루어진 RPN을 학습시켜서 이미지 한장당 300개정도의 proposal만으로 state-of-the-art accuracy를 얻는다. RPN은 기존의 Fast RCNN과 parameter를 share한다.

보통 Selective Search^[2]를 쓰지만, CPU만 사용할 경우 이미지 한장당 2초정도 걸릴 정도로 매우 느리다. 최근 장당 0.2초 걸리는 EdgeBoxes^[3]라는 것도 나왔지만 그래도 상당한 시간이 소요된다. convolution을 share하는 RPN을 제안할건데 이거 쓰면 marginal cost는 장당 10ms정도다.(전체 과정 다 해도 5 fps on GPU)

기존 Fast RCNN에 몇개의 conv net을 붙여서 RPN을 만들고, regular grid에 대해 objectness를 계산할 것이다. 따라서 RPN도 FCN^[4]의 일종이다.

학습할 때에는, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning한다.

preliminary version이 공개된 이후 여기저기 많이 쓰였고 상용으로 Pinterest^[5]에도 쓰였다. ILSVRC, CC 2015 competition에서 상당수 1위들이 모두 Faster R-CNN과 RPN based다.^[6] ~~부럽다~~

RPN은 작은 net으로, Fast R-CNN 의 마지막 conv를 sliding window fashion으로 검사한다. 실험에는 $3 \times 3$이 쓰였는데, 이정도면 receptive field가 충분히 넓다. 이렇게 해서 나온 region proposal을 바탕으로 convolutional feature map(‘Fast R-CNN 의 마지막 conv’의 출력)을 선택하여 box-regressor(reg)와 box-classifier(cls)로 보낸다(둘 다 2-layer-FC). 실제로는 모든 sliding window에 대해 동시에 계산하고, 각각이 최대 k개의 proposal을 가진다. 이 각각을 anchor box라고 한다. anchor box의 중앙은 sliding window의 중앙으로 정렬되고, scale과 aspect ratio정보를 가지고 있다. 실험에는 k=9(aspect 3 $\times$ ratio 3)가 쓰였다.

ㅌ

Multi-box^[7]와 달리 translation-Invariant하다. Multi-Box는 800개의 anchor를 생성하기 위해 k-means를 쓴다.sliding window based method는 모두 translation invariant할 것 같은데. 별로 중요하지 않은듯

Multi-scale detection을 위해 보통 ① image pyramid를 쓰거나, ② filter pyramid(eg. various sized CNNs) 를 쓰거나, 이 둘의 혼합을 쓴다. 우리의 method는 pyramid of anchors로 볼 수 있다(다양한 사이즈와 aspect ratio의 anchor들을 쓰기 때문에). Pyramid of anchors는 이미지도 single size과 filter도 single size이기 때문에 둘과는 구분된다.

RPN training할 때, 각 anchor에 binary class label(objectness, 0/1)을 달아준다. Anchor가 ① ground truth와 IoU > 0.7일때와, ② 단순히 IoU가 최댓값일 때. ②는 가끔 ①만으로는 해당되는(class=1) case가 없기 때문에 넣어준다. Loss는 다음과 같다. $$ L({p_i}, {t_i}) = \frac{1}{N_\text{cls}}\sum_i L_\text{cls}(p_i, p_i^*) + \lambda\frac{1}{N_\text{reg}}\sum_i p_i^* L_\text{reg}(t_i, t_i^*)$$

( $i$ : index of an anchor in a mini-batch, $p_i$ : prob of objectness (of an anchor), $p_i^*$ : ground-truth label(objectness), $0$ or $1$, $t_i$ : 4 parameterized coordinates^[8] of the predicted bbox, $L_\text{reg}$ is smooth $\text{L}_1$ ^[9], $N_\text{cls}$ : mini-batch size, $N_\text{reg}$ : number of anchor locations)

실험에서는 $N_\text{cls}=256$, $N_\text{reg}\approx 2400$, balancing parameter $\lambda=10$이 쓰였다. $\lambda=10$는 Loss의 두 term이 거의 비슷하게 되도록 잡은 것인데, 실제로 여러 값으로 실험해보면 크게 중요하지 않음을 알 수 있다. Normalization($N_\text{cls}$, $N_\text{reg}$등 값)도 크게 중요하지 않았다.
Negative sample($p_i^* = 0$)이 dominant하기 때문에, positive : negative가 최대 1:1이 되도록 맞추었다. 다만, positive가 절반이 되지 않으면(=mini-batch 256일 때, 128이 되지 않으면) negative로 채웠다.

새로 도입된 모든 layer는 zero-mean Gaussian dist($\sigma = 0.01$)로 초기화되었고, shared convolution등 모든 layer는 pretraining된 것을 썼다. lr$=0.001$ for 60k mini-batches and $0.0001$ for the next 20k on the PASCAL VOC, momentum $0.9$, weight decay $0.0005$. Caffe로 구현.

x

↑ Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
↑ 인용 오류: <ref> 태그가 잘못되었습니다; r4라는 이름을 가진 주석에 제공한 텍스트가 없습니다
↑ 인용 오류: <ref> 태그가 잘못되었습니다; r6라는 이름을 가진 주석에 제공한 텍스트가 없습니다
↑ 인용 오류: <ref> 태그가 잘못되었습니다; r7라는 이름을 가진 주석에 제공한 텍스트가 없습니다
↑ 인용 오류: <ref> 태그가 잘못되었습니다; r17라는 이름을 가진 주석에 제공한 텍스트가 없습니다
↑ 인용 오류: <ref> 태그가 잘못되었습니다; r18라는 이름을 가진 주석에 제공한 텍스트가 없습니다
↑ 인용 오류: <ref> 태그가 잘못되었습니다; r27라는 이름을 가진 주석에 제공한 텍스트가 없습니다
↑ 인용 오류: <ref> 태그가 잘못되었습니다; r5라는 이름을 가진 주석에 제공한 텍스트가 없습니다
↑ 인용 오류: <ref> 태그가 잘못되었습니다; r2라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

[r4-2] 인용 오류: <ref> 태그가 잘못되었습니다; r4라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[r6-3] 인용 오류: <ref> 태그가 잘못되었습니다; r6라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[r7-4] 인용 오류: <ref> 태그가 잘못되었습니다; r7라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[r17-5] 인용 오류: <ref> 태그가 잘못되었습니다; r17라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[r18-6] 인용 오류: <ref> 태그가 잘못되었습니다; r18라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[r27-7] 인용 오류: <ref> 태그가 잘못되었습니다; r27라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[r5-8] 인용 오류: <ref> 태그가 잘못되었습니다; r5라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[r2-9] 인용 오류: <ref> 태그가 잘못되었습니다; r2라는 이름을 가진 주석에 제공한 텍스트가 없습니다

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

@@ 24번째 줄: / 24번째 줄: @@
 ==ㅌ==
+Multi-box<ref name=r27/>와 달리 translation-Invariant하다. Multi-Box는 800개의 anchor를 생성하기 위해 k-means를 쓴다.<small class=gray>sliding window based method는 모두 translation invariant할 것 같은데. 별로 중요하지 않은듯</small>
+Multi-scale detection을 위해 보통 ① image pyramid를 쓰거나, ② filter pyramid(eg. various sized CNNs) 를 쓰거나, 이 둘의 혼합을 쓴다. 우리의 method는 pyramid of anchors로 볼 수 있다(다양한 사이즈와 aspect ratio의 anchor들을 쓰기 때문에). Pyramid of anchors는 이미지도 single size과 filter도 single size이기 때문에 둘과는 구분된다.
+RPN training할 때, 각 anchor에 binary class label(objectness, 0/1)을 달아준다. Anchor가 ① ground truth와 IoU > 0.7일때와, ② 단순히 IoU가 최댓값일 때. ②는 가끔 ①만으로는 해당되는(class=1) case가 없기 때문에 넣어준다. Loss는 다음과 같다.
+$$ L({p_i}, {t_i}) = \frac{1}{N_\text{cls}}\sum_i L_\text{cls}(p_i, p_i^*)
++ \lambda\frac{1}{N_\text{reg}}\sum_i p_i^* L_\text{reg}(t_i, t_i^*)$$
+( \(i\) : index of an anchor in a mini-batch, \(p_i\) : prob of objectness (of an anchor), \(p_i^*\) : ground-truth label(objectness), \(0\) or \(1\), \(t_i\) : 4 parameterized coordinates<ref name=r5></ref> of the predicted bbox, \(L_\text{reg}\) is smooth \(\text{L}_1\) <ref name=r2></ref>, \(N_\text{cls}\) : mini-batch size, \(N_\text{reg}\) : number of anchor locations)<poem>
+ 실험에서는 \(N_\text{cls}=256\), \(N_\text{reg}\approx 2400\), balancing parameter \(\lambda=10\)이 쓰였다. \(\lambda=10\)는 Loss의 두 term이 거의 비슷하게 되도록 잡은 것인데, 실제로 여러 값으로 실험해보면 크게 중요하지 않음을 알 수 있다. Normalization(\(N_\text{cls}\), \(N_\text{reg}\)등 값)도 크게 중요하지 않았다.
+ Negative sample(\(p_i^* = 0\))이 dominant하기 때문에, positive : negative가 ''최대'' 1:1이 되도록 맞추었다. 다만, positive가 절반이 되지 않으면(=mini-batch 256일 때, 128이 되지 않으면) negative로 채웠다.</poem>
+새로 도입된 모든 layer는 zero-mean Gaussian dist(\(\sigma = 0.01\))로 초기화되었고, shared convolution등 모든 layer는 pretraining된 것을 썼다. lr\(=0.001\) for 60k mini-batches and \(0.0001\) for the next 20k on the PASCAL VOC, momentum \(0.9\), weight decay \(0.0005\). Caffe로 구현.
 =x=

둘러보기 메뉴

"Faster RCNN"의 두 판 사이의 차이

2017년 8월 7일 (월) 11:41 판

ㅌ

x