Faster RCNN - 편집 역사

2018년 9월 5일 (수) 08:28에 Admin님의 편집

2018-09-05T08:28:54Z

2017년 8월 7일 (월) 08:00에 Admin님의 편집

2017-08-07T08:00:15Z

2017년 8월 7일 (월) 07:58에 Admin님의 편집

2017-08-07T07:58:56Z

2017년 8월 7일 (월) 07:57에 Admin님의 편집

2017-08-07T07:57:01Z

2017년 8월 7일 (월) 07:55에 Admin님의 편집

2017-08-07T07:55:42Z

2017년 8월 7일 (월) 07:54에 Admin님의 편집

2017-08-07T07:54:15Z

2017년 8월 7일 (월) 07:52에 Admin님의 편집

2017-08-07T07:52:51Z

Admin: /* x */

2017-08-07T07:50:24Z

2017년 8월 7일 (월) 07:17에 Admin님의 편집

2017-08-07T07:17:17Z

2017년 8월 7일 (월) 07:13에 Admin님의 편집

2017-08-07T07:13:54Z

@@ 23번째 줄: / 23번째 줄: @@
 ration and convnets: Powering item-to-item recommendations
 on pinterest,” arXiv:1511.04003, 2015.</ref>에도 쓰였다. ILSVRC, CC 2015 competition에서 상당수 1위들이 모두 Faster R-CNN과 RPN  based다.<ref name=r18>K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
-for image recognition,” arXiv:1512.03385, 2015.</ref> <del>부럽다</del>
+for image recognition,” arXiv:1512.03385, 2015.</ref>
 RPN은 작은 net으로, Fast R-CNN 의 마지막 conv를 sliding window fashion으로 검사한다. 실험에는 \(3 \times 3\)이 쓰였는데, 이정도면 receptive field가 충분히 넓다. 이렇게 해서 나온 region proposal을 바탕으로 convolutional feature map(‘Fast R-CNN 의 마지막 conv’의 출력)을 선택하여 box-regressor(<i>reg</i>)와 box-classifier(<i>cls</i>)로 보낸다(둘 다 2-layer-FC). 실제로는 모든 sliding window에 대해 동시에 계산하고, 각각이 최대 <i>k</i>개의 proposal을 가진다. 이 각각을 '''anchor box'''라고 한다. anchor box의 중앙은 sliding window의 중앙으로 정렬되고, scale과 aspect ratio정보를 가지고 있다. 실험에는 \(k=9\)( scale 3 종류 × aspect ratio 3종류. 각각 \(128^2, 256^2, 512^2, 1:1, 1:2, 2:1\). 뒤에 PASCAL VOC 실험할 때는 object size가 작은 것을 고려해서 \(64^2\)도 추가했다. aspect ratio를 1:1로 고정하고 scale만 세종류 해도 맞먹는 결과를 얻을 수 있으나 system의 flexibility를 위해 ratio도 세 종류 쓰게 만들었다)가 쓰였다.(이미지의 짧은 쪽을 600px로 resize하고 stride를 16px정도로 주었으므로 1000 × 600 이미지의 경우 대략 anchor가 2만개정도. (\(\approx 1000/16 × 600/16 × 9 \) \(\approx 60 × 40 × 9\)). 그러나 cross-boundary 때문에 여기서 6천개정도만 쓰인다. 나중에 test time에는 cross boundary가 예측되기도 하는데 이는 cnn의 receptive field가 image boundary에 걸쳐있기 때문. 걸러진 6천개도 모두 쓰이는 것이 아니고 \(cls\) score > 0.7이상인 것들 대상으로 NMS([[non-maximum suppression]])적용하고 2천개정도만 남음.)

@@ 52번째 줄: / 52번째 줄: @@
 학습할 때, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning할 수 있고(‘Alternating training’), 아예 end-to-end로(‘Approximate joint training’) 할 수 있는데<ref>논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” arXiv:1512.04412, 2015. 참고하라고 써있음.</ref> 논문은 (pragmatic) 4-step alternating training제안. ① 먼저 ‘pretraining된 net + RPN에 덧붙여진 CNN’으로 RPN학습 시키고, ② 이 RPN을 바탕으로 (역시 pretraining된 net인) detection net(여기서는 Fast R-CNN)을 fine-tuning. ③ 다시 이 fine-tuned net과 위에 ‘RPN에 덧붙여진 CNN’을 붙여서 RPN학습(여기서부터 conv net을 share하게 된다). ④ 이 상태에서 detection net다시 학습.
-pretrained net으로는 ZF(fast version. 5 conv  + 3 fc <ref name=r32>M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014.</ref>)와 VGG-16(13 conv + 3 fc<ref name=r3>K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.</ref>)
+pretrained net으로는 ZF(fast version. 5 conv  + 3 fc <ref name=r32>M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014.</ref>)와 VGG-16(13 conv + 3 fc<ref name=r3>K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.</ref>) 사용.
 VGG-16을 ResNet으로만 바꾸어도 성능의 향상이 있다.

@@ 50번째 줄: / 50번째 줄: @@
 새로 도입된 모든 layer는 zero-mean Gaussian dist(\(\sigma = 0.01\))로 초기화되었고, shared convolution등 모든 layer는 pretraining된 것을 썼다. lr\(=0.001\) for 60k mini-batches and \(0.0001\) for the next 20k on the PASCAL VOC, momentum \(0.9\), weight decay \(0.0005\). Caffe로 구현.
-학습할 때, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning할 수 있고(‘Alternating training’), 아예 end-to-end로(‘Approximate joint training’) 할 수 있는데<ref>논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 J. Dai et al<ref name=r15>J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” arXiv:1512.04412, 2015.</ref> 참고하라고 써있음.</ref> 논문은 (pragmatic) 4-step alternating training제안. ① 먼저 ‘pretraining된 net + RPN에 덧붙여진 CNN’으로 RPN학습 시키고, ② 이 RPN을 바탕으로 (역시 pretraining된 net인) detection net(여기서는 Fast R-CNN)을 fine-tuning. ③ 다시 이 fine-tuned net과 위에 ‘RPN에 덧붙여진 CNN’을 붙여서 RPN학습(여기서부터 conv net을 share하게 된다). ④ 이 상태에서 detection net다시 학습.
+학습할 때, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning할 수 있고(‘Alternating training’), 아예 end-to-end로(‘Approximate joint training’) 할 수 있는데<ref>논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” arXiv:1512.04412, 2015. 참고하라고 써있음.</ref> 논문은 (pragmatic) 4-step alternating training제안. ① 먼저 ‘pretraining된 net + RPN에 덧붙여진 CNN’으로 RPN학습 시키고, ② 이 RPN을 바탕으로 (역시 pretraining된 net인) detection net(여기서는 Fast R-CNN)을 fine-tuning. ③ 다시 이 fine-tuned net과 위에 ‘RPN에 덧붙여진 CNN’을 붙여서 RPN학습(여기서부터 conv net을 share하게 된다). ④ 이 상태에서 detection net다시 학습.
 pretrained net으로는 ZF(fast version. 5 conv  + 3 fc <ref name=r32>M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014.</ref>)와 VGG-16(13 conv + 3 fc<ref name=r3>K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.</ref>)

@@ 13번째 줄: / 13번째 줄: @@
-보통 Selective Search<ref name=r4>J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul- ders, “Selective search for object recognition,” International
+보통 Selective Search<ref name=r4>J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International
 Journal of Computer Vision (IJCV), 2013.</ref>를 쓰지만, CPU만 사용할 경우 이미지 한장당 2초정도 걸릴 정도로 매우 느리다. 최근 장당 0.2초 걸리는 EdgeBoxes<ref name=r6>C. L. Zitnick and P. Dolla ́r, “Edge boxes: Locating object proposals from edges,” in European Conference on Computer Vision(ECCV),2014.</ref>라는 것도 나왔지만 그래도 상당한 시간이 소요된다. convolution을 share하는 RPN을 제안할건데 이거 쓰면 marginal cost는 장당 10ms정도다.(전체 과정 다 해도 VGG-16에서 5 fps on GPU, ZF net에서 17 fps)
@@ 50번째 줄: / 50번째 줄: @@
 새로 도입된 모든 layer는 zero-mean Gaussian dist(\(\sigma = 0.01\))로 초기화되었고, shared convolution등 모든 layer는 pretraining된 것을 썼다. lr\(=0.001\) for 60k mini-batches and \(0.0001\) for the next 20k on the PASCAL VOC, momentum \(0.9\), weight decay \(0.0005\). Caffe로 구현.
-학습할 때, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning할 수 있고(‘Alternating training’), 아예 end-to-end로(‘Approximate joint training’) 할 수 있는데<ref>논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 r15 참고하라고 써있음.</ref> 논문은 (pragmatic) 4-step alternating training제안. ① 먼저 ‘pretraining된 net + RPN에 덧붙여진 CNN’으로 RPN학습 시키고, ② 이 RPN을 바탕으로 (역시 pretraining된 net인) detection net(여기서는 Fast R-CNN)을 fine-tuning. ③ 다시 이 fine-tuned net과 위에 ‘RPN에 덧붙여진 CNN’을 붙여서 RPN학습(여기서부터 conv net을 share하게 된다). ④ 이 상태에서 detection net다시 학습.
+학습할 때, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning할 수 있고(‘Alternating training’), 아예 end-to-end로(‘Approximate joint training’) 할 수 있는데<ref>논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 J. Dai et al<ref name=r15>J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” arXiv:1512.04412, 2015.</ref> 참고하라고 써있음.</ref> 논문은 (pragmatic) 4-step alternating training제안. ① 먼저 ‘pretraining된 net + RPN에 덧붙여진 CNN’으로 RPN학습 시키고, ② 이 RPN을 바탕으로 (역시 pretraining된 net인) detection net(여기서는 Fast R-CNN)을 fine-tuning. ③ 다시 이 fine-tuned net과 위에 ‘RPN에 덧붙여진 CNN’을 붙여서 RPN학습(여기서부터 conv net을 share하게 된다). ④ 이 상태에서 detection net다시 학습.
 pretrained net으로는 ZF(fast version. 5 conv  + 3 fc <ref name=r32>M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014.</ref>)와 VGG-16(13 conv + 3 fc<ref name=r3>K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.</ref>)

@@ 36번째 줄: / 36번째 줄: @@
 + \lambda\frac{1}{N_\text{reg}}\sum_i p_i^* L_\text{reg}(t_i, t_i^*)$$
 ( \(i\) : index of an anchor in a mini-batch, \(p_i\) : prob of objectness (of an anchor), \(p_i^*\) : ground-truth label(objectness), \(0\) or \(1\), \(t_i\) : 4 parameterized coordinates<ref name=r5>R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
-hierarchies for accurate object detection and semantic seg- mentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.</ref> of the predicted bbox, \(L_\text{reg}\) is smooth \(\text{L}_1\) <ref name=r2>R. Girshick, “[[Fast RCNN|Fast R-CNN]],” in IEEE International Conference on Computer Vision (ICCV), 2015.</ref>, \(N_\text{cls}\) : mini-batch size, \(N_\text{reg}\) : number of anchor locations)<br>
+hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.</ref> of the predicted bbox, \(L_\text{reg}\) is smooth \(\text{L}_1\) <ref name=r2>R. Girshick, “[[Fast RCNN|Fast R-CNN]],” in IEEE International Conference on Computer Vision (ICCV), 2015.</ref>, \(N_\text{cls}\) : mini-batch size, \(N_\text{reg}\) : number of anchor locations)<br>
 &nbsp;실험에서는 \(N_\text{cls}=256\), \(N_\text{reg}\approx 2400\), balancing parameter \(\lambda=10\)이 쓰였다. \(\lambda=10\)는 Loss의 두 term이 거의 비슷하게 되도록 잡은 것인데, 실제로 여러 값으로 실험해보면 크게 중요하지 않음을 알 수 있다.<ref>
 {| class=wikitable style="text-align: center; ;"
@@ 52번째 줄: / 52번째 줄: @@
 학습할 때, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning할 수 있고(‘Alternating training’), 아예 end-to-end로(‘Approximate joint training’) 할 수 있는데<ref>논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 r15 참고하라고 써있음.</ref> 논문은 (pragmatic) 4-step alternating training제안. ① 먼저 ‘pretraining된 net + RPN에 덧붙여진 CNN’으로 RPN학습 시키고, ② 이 RPN을 바탕으로 (역시 pretraining된 net인) detection net(여기서는 Fast R-CNN)을 fine-tuning. ③ 다시 이 fine-tuned net과 위에 ‘RPN에 덧붙여진 CNN’을 붙여서 RPN학습(여기서부터 conv net을 share하게 된다). ④ 이 상태에서 detection net다시 학습.
-pretrained net으로는 ZF(fast version. 5 conv  + 3 fc <ref name=r32></ref>)와 VGG-16(13 conv + 3 fc<ref name=r3>K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.</ref>)
+pretrained net으로는 ZF(fast version. 5 conv  + 3 fc <ref name=r32>M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in European Conference on Computer Vision (ECCV), 2014.</ref>)와 VGG-16(13 conv + 3 fc<ref name=r3>K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.</ref>)
 VGG-16을 ResNet으로만 바꾸어도 성능의 향상이 있다.

@@ 13번째 줄: / 13번째 줄: @@
-보통 Selective Search<ref name=r4></ref>를 쓰지만, CPU만 사용할 경우 이미지 한장당 2초정도 걸릴 정도로 매우 느리다. 최근 장당 0.2초 걸리는 EdgeBoxes<ref name=r6></ref>라는 것도 나왔지만 그래도 상당한 시간이 소요된다. convolution을 share하는 RPN을 제안할건데 이거 쓰면 marginal cost는 장당 10ms정도다.(전체 과정 다 해도 VGG-16에서 5 fps on GPU, ZF net에서 17 fps)
+보통 Selective Search<ref name=r4>J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul- ders, “Selective search for object recognition,” International
 <center><div style='width:50%; padding:1em'>[[file:fasterrcnn.png]]</div></center>
-기존 [[Fast RCNN]]에 몇개의 conv net을 붙여서 RPN을 만들고, regular grid에 대해 objectness를 계산할 것이다. 따라서 RPN도 FCN<ref name=r7></ref>의 일종이다.
+기존 [[Fast RCNN]]에 몇개의 conv net을 붙여서 RPN을 만들고, regular grid에 대해 objectness를 계산할 것이다. 따라서 RPN도 FCN<ref name=r7>J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.</ref>의 일종이다.
-preliminary version이 공개된 이후 여기저기 많이 쓰였고 상용으로 Pinterest<ref name=r17></ref>에도 쓰였다. ILSVRC, CC 2015 competition에서 상당수 1위들이 모두 Faster R-CNN과 RPN  based다.<ref name=r18></ref> <del>부럽다</del>
+preliminary version이 공개된 이후 여기저기 많이 쓰였고 상용으로 Pinterest<ref name=r17>D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human cu-
 RPN은 작은 net으로, Fast R-CNN 의 마지막 conv를 sliding window fashion으로 검사한다. 실험에는 \(3 \times 3\)이 쓰였는데, 이정도면 receptive field가 충분히 넓다. 이렇게 해서 나온 region proposal을 바탕으로 convolutional feature map(‘Fast R-CNN 의 마지막 conv’의 출력)을 선택하여 box-regressor(<i>reg</i>)와 box-classifier(<i>cls</i>)로 보낸다(둘 다 2-layer-FC). 실제로는 모든 sliding window에 대해 동시에 계산하고, 각각이 최대 <i>k</i>개의 proposal을 가진다. 이 각각을 '''anchor box'''라고 한다. anchor box의 중앙은 sliding window의 중앙으로 정렬되고, scale과 aspect ratio정보를 가지고 있다. 실험에는 \(k=9\)( scale 3 종류 × aspect ratio 3종류. 각각 \(128^2, 256^2, 512^2, 1:1, 1:2, 2:1\). 뒤에 PASCAL VOC 실험할 때는 object size가 작은 것을 고려해서 \(64^2\)도 추가했다. aspect ratio를 1:1로 고정하고 scale만 세종류 해도 맞먹는 결과를 얻을 수 있으나 system의 flexibility를 위해 ratio도 세 종류 쓰게 만들었다)가 쓰였다.(이미지의 짧은 쪽을 600px로 resize하고 stride를 16px정도로 주었으므로 1000 × 600 이미지의 경우 대략 anchor가 2만개정도. (\(\approx 1000/16 × 600/16 × 9 \) \(\approx 60 × 40 × 9\)). 그러나 cross-boundary 때문에 여기서 6천개정도만 쓰인다. 나중에 test time에는 cross boundary가 예측되기도 하는데 이는 cnn의 receptive field가 image boundary에 걸쳐있기 때문. 걸러진 6천개도 모두 쓰이는 것이 아니고 \(cls\) score > 0.7이상인 것들 대상으로 NMS([[non-maximum suppression]])적용하고 2천개정도만 남음.)
-Multi-box<ref name=r27/>와 달리 translation-Invariant하다. Multi-Box는 800개의 anchor를 생성하기 위해 k-means를 쓴다. <small class=gray>sliding window based method는 모두 translation invariant할 것 같은데. 별로 중요하지 않은듯</small>
+Multi-box<ref name=r27>C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable,
 Multi-scale detection을 위해 보통 ① image pyramid를 쓰거나, ② filter pyramid(eg. various sized CNNs) 를 쓰거나, 이 둘의 혼합을 쓴다. 우리의 method는 pyramid of anchors로 볼 수 있다(다양한 사이즈와 aspect ratio의 anchor들을 쓰기 때문에). Pyramid of anchors는 이미지도 single size과 filter도 single size이기 때문에 둘과는 구분된다.
@@ 46번째 줄: / 51번째 줄: @@
 학습할 때, region proposal에 대해 한번, object detection에 대해 한번, 이런식으로 번갈아가면서 fine-tuning할 수 있고(‘Alternating training’), 아예 end-to-end로(‘Approximate joint training’) 할 수 있는데<ref>논문에는 ‘Non-approximate joint training’얘기도 잠깐 나옴. bbox에 대한 gradient도 고려해주어야 한다는 내용. 구체적인 방법은 r15 참고하라고 써있음.</ref> 논문은 (pragmatic) 4-step alternating training제안. ① 먼저 ‘pretraining된 net + RPN에 덧붙여진 CNN’으로 RPN학습 시키고, ② 이 RPN을 바탕으로 (역시 pretraining된 net인) detection net(여기서는 Fast R-CNN)을 fine-tuning. ③ 다시 이 fine-tuned net과 위에 ‘RPN에 덧붙여진 CNN’을 붙여서 RPN학습(여기서부터 conv net을 share하게 된다). ④ 이 상태에서 detection net다시 학습.
-pretrained net으로는 ZF(fast version. 5 conv  + 3 fc <ref name=r32></ref>)와 VGG-16(13 conv + 3 fc<ref name=r3></ref>)
+pretrained net으로는 ZF(fast version. 5 conv  + 3 fc <ref name=r32></ref>)와 VGG-16(13 conv + 3 fc<ref name=r3>K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.</ref>)
 VGG-16을 ResNet으로만 바꾸어도 성능의 향상이 있다.
 ----
 <references/>

@@ 12번째 줄: / 12번째 줄: @@
 </poem>
 보통 Selective Search<ref name=r4></ref>를 쓰지만, CPU만 사용할 경우 이미지 한장당 2초정도 걸릴 정도로 매우 느리다. 최근 장당 0.2초 걸리는 EdgeBoxes<ref name=r6></ref>라는 것도 나왔지만 그래도 상당한 시간이 소요된다. convolution을 share하는 RPN을 제안할건데 이거 쓰면 marginal cost는 장당 10ms정도다.(전체 과정 다 해도 5 fps on GPU)
 <center><div style='width:50%; padding:1em'>[[file:fasterrcnn.png]]</div></center>
@@ 20번째 줄: / 21번째 줄: @@
 preliminary version이 공개된 이후 여기저기 많이 쓰였고 상용으로 Pinterest<ref name=r17></ref>에도 쓰였다. ILSVRC, CC 2015 competition에서 상당수 1위들이 모두 Faster R-CNN과 RPN  based다.<ref name=r18></ref> <del>부럽다</del>
-RPN은 작은 net으로, Fast R-CNN 의 마지막 conv를 sliding window fashion으로 검사한다. 실험에는 \(3 \times 3\)이 쓰였는데, 이정도면 receptive field가 충분히 넓다. 이렇게 해서 나온 region proposal을 바탕으로 convolutional feature map(‘Fast R-CNN 의 마지막 conv’의 출력)을 선택하여 box-regressor(<i>reg</i>)와 box-classifier(<i>cls</i>)로 보낸다(둘 다 2-layer-FC). 실제로는 모든 sliding window에 대해 동시에 계산하고, 각각이 최대 <i>k</i>개의 proposal을 가진다. 이 각각을 '''anchor box'''라고 한다. anchor box의 중앙은 sliding window의 중앙으로 정렬되고, scale과 aspect ratio정보를 가지고 있다. 실험에는 \(k=9\)(aspect 3 \(\times\) ratio 3, \(128^2, 256^2, 512^2, 1:1, 1:2, 2:1\))가 쓰였다.(이미지의 짧은 쪽을 600px로 resize하고 stride를 16px정도로 주었으므로 1000 by 600 이미지의 경우 대략 anchor가 2만개정도. (\(\approx 1000/16 \times 600/16 \approx 60 \times 40 \times 9\). 그러나 cross-boundary 때문에 여기서 6천개정도만 쓰인다. 나중에 test time에는 cross boundary가 예측되기도 하는데 이는 cnn의 receptive field가 image boundary에 걸쳐있기 때문. 걸러진 6천개도 모두 쓰이는 것이 아니고 \(cls\) score > 0.7이상인 것들 대상으로 NMS([[non-maximum suppression]])적용하고 2천개정도만 남음.)
+RPN은 작은 net으로, Fast R-CNN 의 마지막 conv를 sliding window fashion으로 검사한다. 실험에는 \(3 \times 3\)이 쓰였는데, 이정도면 receptive field가 충분히 넓다. 이렇게 해서 나온 region proposal을 바탕으로 convolutional feature map(‘Fast R-CNN 의 마지막 conv’의 출력)을 선택하여 box-regressor(<i>reg</i>)와 box-classifier(<i>cls</i>)로 보낸다(둘 다 2-layer-FC). 실제로는 모든 sliding window에 대해 동시에 계산하고, 각각이 최대 <i>k</i>개의 proposal을 가진다. 이 각각을 '''anchor box'''라고 한다. anchor box의 중앙은 sliding window의 중앙으로 정렬되고, scale과 aspect ratio정보를 가지고 있다. 실험에는 \(k=9\)( scale 3 종류 × aspect ratio 3종류. 각각 \(128^2, 256^2, 512^2, 1:1, 1:2, 2:1\). 뒤에 PASCAL VOC 실험할 때는 object size가 작은 것을 고려해서 \(64^2\)도 추가했다. aspect ratio를 1:1로 고정하고 scale만 세종류 해도 맞먹는 결과를 얻을 수 있으나 system의 flexibility를 위해 ratio도 세 종류 쓰게 만들었다)가 쓰였다.(이미지의 짧은 쪽을 600px로 resize하고 stride를 16px정도로 주었으므로 1000 × 600 이미지의 경우 대략 anchor가 2만개정도. (\(\approx 1000/16 × 600/16 × 9 \) \(\approx 60 × 40 × 9\). 그러나 cross-boundary 때문에 여기서 6천개정도만 쓰인다. 나중에 test time에는 cross boundary가 예측되기도 하는데 이는 cnn의 receptive field가 image boundary에 걸쳐있기 때문. 걸러진 6천개도 모두 쓰이는 것이 아니고 \(cls\) score > 0.7이상인 것들 대상으로 NMS([[non-maximum suppression]])적용하고 2천개정도만 남음.)
 Multi-box<ref name=r27/>와 달리 translation-Invariant하다. Multi-Box는 800개의 anchor를 생성하기 위해 k-means를 쓴다. <small class=gray>sliding window based method는 모두 translation invariant할 것 같은데. 별로 중요하지 않은듯</small>
@@ 37번째 줄: / 38번째 줄: @@
 | mAP(%) || 67.2 || 68.9 || 69.9 || 69.1
 |}
 </ref> Normalization(\(N_\text{cls}\), \(N_\text{reg}\)등 값)도 크게 중요하지 않았다.<br>
 &nbsp;Negative sample(\(p_i^* = 0\))이 dominant하기 때문에, positive : negative가 ''최대'' 1:1이 되도록 맞추었다. 다만, positive가 절반이 되지 않으면(=mini-batch 256일 때, 128이 되지 않으면) negative로 채웠다.