Ross Girshick

Microsoft Research

그냥 R-CNN^[1]은 이런가봄 : R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned. … Detection with VGG16 takes 47s / image (on a Nvidia K40 GPU overclocked to 875 MHz.). 이야 ~

그냥 R-CNN은 object proposal마다 cnn forward하는데, SPPnets^[2]가 미리 cnn돌려놓고 거기서부터 feature뽑아내는 식으로 test time은 10~100배, training time도 3배정도 개선했다. 단, SPPnets는 R-CNN과 달리 spatial pyramid pooling앞의 convolutional layers를 update할 수 없다.

입력으로는 이미지와 object proposals를 받는다. 먼저 이미지가 convnet지나면서 feature map을 만들고 이 feature map과 앞의 object proposal로부터 RoI pooling layer가 일정한 길이의 feature vector들을 뽑아낸다. 이 feature vector들이 fc를 지나가면서 두가지 출력을 내는데 하나는 클래스정보(K object class + ‘background’의 softmax), 다른 하나는 영역(refined bounding box by category-specific bounding-box regressors).

이미지 하나당 RoI하나씩 해서 학습하는 것(SPPnet과 R-CNN은 이렇게 한다)보다 같은 RoI개수라도 적은 이미지 수를 사용하면 학습이 빠르다. 이미지 하나당 cnn한번만 통과하면 RoI마다 feature를 얻어내기 때문이다(cnn결과를 share함). 보통 이미지 전체가 RoI로 잡히는 일이 많기 때문에 이렇게 하면 계산속도 이득이 크다. 이미지당 중복된 RoI를 뽑을 때 서로간 correlation이 문제될 수 있지만, 실제 실험(이미지 당 64개씩 RoI, batch size=2)결과 괜찮았다.

loss는 다음을 쓴다. $$ L = L_\text{cls}(p,u) + \lambda[u \ge 1] L_\text{loc}(t^u, v) $$ $ L_\text{cls}(p,u) $ 는 log loss for true class $u$, $v$는 detection된 classs, (bounding box의 차이를 L$_1$ norm 한 값을 $x$라 하면,) L$_\text{loc}$ 은 $|x| \lt 1 $일 때 $0.5x^2$, 아니면 $|x| -0.5$ 이다. Iverson bracket은 조건을 만족할 때는 $1$, 아니면 $0$이다. background class는 bbox가 없으므로 $u=0$으로 bg class를 지정하면, 위 식대로 된다. 실험에서 $\lambda$는 $1$로 통일. L$_1$ norm이 L$_2$ norm보다 outliers에 덜 sensitive하다고 한다.~~진짜?~~ 위에서 나온 좌표값은 모두 shift value이므로, mean, var로 정규화했다. 이와 비슷한 loss를 쓴 논문^[3]이 있으나, 해당 논문은 localization과 classification을 아예 별개의 network을 썼고, Overfeat^[4], R-CNN^[1]나 SPPnet^[2]은 stage-wise training을 주장했다.

학습할 때, RoI중 25%는 IoU(intersection over union) $>0.5$에서, 나머지는 IoU$=[0.1, 0.5)$에서 썼다. 구간 시작값 $0.1$은 heuristic하게 잡은 것이다. background는 IoU=$0$이다. 이미지는 $0.5$의 확률로 horizontally filp이고 이 외에 augmentation은 하지 않았다.

RoI pooling layer에서 back propagation은 다음과 같은데, $$\frac{\partial L}{\partial x_i} = \sum_r \sum_j [ i = i^* (r,j)] \frac{\partial L}{\partial y_{rj}} $$ 대강 이해하기로는, one batch안에서 (각 이미지당) $x_i$가 최대일 때만 loss계산해서 bp하겠다는 말 같은데, ‘$x_i$가 최대’라는게 무슨 뜻인지 모르겠다. 그냥 norm인가.

원문은 다음과 같다.

Let $x_i ∈ \R$ be the $i$-th activation input into the RoI pooling layer and let $y_{rj}$ be the layer’s $j$-th output from the $r$-th RoI. The RoI pooling layer computes $y_{rj} = x_{i^{∗}(r,j)}$, in which $i^{∗}(r, j) = \text{argmax}_{ i'∈\mathcal{R}(r,j)} x_{i'}$ . $\mathcal{R}(r, j) $ is the index set of inputs in the sub-window over which the output unit $y_{rj}$ max pools. A single $x_i$ may be assigned to several different outputs $y_{rj}$ .
(중략)
In words, for each mini-batch RoI $r$ and for each pooling output unit $y_{rj}$, the partial derivative $∂L/∂y_{rj}$ is accumulated if $i$ is the argmax selected for $y_{rj}$ by max pooling.

x

↑ ^1.0 ^1.1 R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
↑ ^2.0 ^2.1 K.He, X.Zhang, S.Ren, and J.Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV,2014.
↑ D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.
↑ P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In ICLR, 2014.

blog comments powered by Disqus

[r9-1] 1.0 ^1.1 R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.

[r11-2] 2.0 ^2.1 K.He, X.Zhang, S.Ren, and J.Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV,2014.

[r6-3] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.

[r19-4] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In ICLR, 2014.

[1]

[2]

[3]

[4]

둘러보기 메뉴

Fast RCNN

x

x