Fast RCNN - 편집 역사

2017년 8월 6일 (일) 15:34에 Admin님의 편집

2017-08-06T15:34:04Z

2017년 8월 6일 (일) 15:32에 Admin님의 편집

2017-08-06T15:32:04Z

2017년 8월 6일 (일) 15:29에 Admin님의 편집

2017-08-06T15:29:35Z

2017년 8월 6일 (일) 15:29에 Admin님의 편집

2017-08-06T15:29:18Z

2017년 8월 6일 (일) 14:58에 Admin님의 편집

2017-08-06T14:58:14Z

2017년 8월 4일 (금) 09:31에 Admin님의 편집

2017-08-04T09:31:30Z

2017년 8월 4일 (금) 09:04에 Admin님의 편집

2017-08-04T09:04:39Z

2017년 8월 4일 (금) 09:02에 Admin님의 편집

2017-08-04T09:02:59Z

2017년 8월 4일 (금) 09:00에 Admin님의 편집

2017-08-04T09:00:03Z

2017년 8월 4일 (금) 08:58에 Admin님의 편집

2017-08-04T08:58:41Z

@@ 1번째 줄: / 1번째 줄: @@
 Ross Girshick  <ref>혼자 썼는데 본문의 모든 주어가 We. 관행인가봄. <br><span class=gray>아니 그럼 [https://ko.wikipedia.org/wiki/체스터_윌러드 체스터 윌러드]는 뭐야.</span></ref>
-Microsoft Research<ref>지금은 페북에 있는듯</ref>
+Microsoft Research
 [https://github.com/rbgirshick/fast-rcnn github]

@@ 35번째 줄: / 35번째 줄: @@
 <pdf height=610>file:frcnn12.pdf</pdf>
-제대로 이해 했는지 모르겠다만<ref>원문은 다음과 같다.
+① max pooling해서 나온 후보군의 경우 오직 max값을 가진 pixel만 넘어오게 되어 있으므로, ([[unpooling switch]]를 사용하여) 해당 pixel에 대해서만 gradient를 계산할 것이고, ② roi를 뽑는 과정에서 computation을 share하게 되어 있으므로 한 픽셀이 여러곳으로 들어갈 수 있는데, bp되는 모든 값을 sum하겠다는 뜻. <ref><del>제대로 이해한거 맞는지 자신이 없어서 원문을…</del>
 <poem><blockquote>Let \(x_i ∈ \R\) be the \(i\)-th activation input into the RoI pooling layer and let \(y_{rj}\) be the layer’s \(j\)-th output from the \(r\)-th RoI. The RoI pooling layer computes \(y_{rj} = x_{i^{∗}(r,j)}\), in which \(i^{∗}(r, j) = \text{argmax}_{ i'∈\mathcal{R}(r,j)} x_{i'}\) . \(\mathcal{R}(r, j) \) is the index set of inputs in the sub-window over which the output unit \(y_{rj}\) max pools. A single \(x_i\) may be assigned to several different outputs \(y_{rj}\) .<br>(중략)<br>
-In words, for each mini-batch RoI \(r\) and for each pooling output unit \(y_{rj}\), the partial derivative \(∂L/∂y_{rj}\) is accumulated if \(i\) is the argmax selected for \(y_{rj}\) by max pooling.</blockquote></poem></ref>, ① max pooling해서 나온 후보군의 경우 오직 max값을 가진 pixel만 넘어오게 되어 있으므로, ([[unpooling switch]]를 사용하여) 해당 pixel에 대해서만 gradient를 계산할 것이고, ② roi를 뽑는 과정에서 computation을 share하게 되어 있으므로 한 픽셀이 여러곳으로 들어갈 수 있는데, bp되는 모든 값을 sum하겠다는 뜻.
+In words, for each mini-batch RoI \(r\) and for each pooling output unit \(y_{rj}\), the partial derivative \(∂L/∂y_{rj}\) is accumulated if \(i\) is the argmax selected for \(y_{rj}\) by max pooling.</blockquote></poem></ref>
 Scale invariance위해 두가지 방법을 해봄. 하나는 이미지 사이즈를 일정하게 고정하고 brutal force하게 RoI를 주는법, 하나는 이미지 피라미드에서 취하는 법. 첫번째 방법은 net이 모든 object size에 대해 학습해야 하고(대부분의 실험을 이렇게 함) 두번째 방법은 RoI가 거의 일정하게 유지된다. 실험할때는 \(224^2\) pixel에 최대한 가깝게 했다. 실험결과는 SPPnet과 일치하는데, multi-scale이 아주 약간 더 좋기는 하지만, 둘의 성능이 거의 동일하다. 따라서 다른 모든 실험은 single-scale로 이루어졌다.

@@ 9번째 줄: / 9번째 줄: @@
 아래 논문내용중 등장하는 슬라이드 조각조각은 저자가 직접 작성해서 공개<ref>http://www.robots.ox.ac.uk/~tvg/publications/talks/fast-rcnn-slides.pdf</ref>한 것.
-그냥 R-CNN<ref name=r9>R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.</ref> : R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned. … Detection with VGG16 takes 47s / image (on a Nvidia K40 GPU overclocked to 875 MHz.). <del>엄청나게 느림</del>
+Slow R-CNN<ref name=r9>R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.</ref> : R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned. … Detection with VGG16 takes 47s / image (on a Nvidia K40 GPU overclocked to 875 MHz.). <del>엄청나게 느림</del>
 <pdf height=610>file:Frcnn2.pdf</pdf>

@@ 9번째 줄: / 9번째 줄: @@
 아래 논문내용중 등장하는 슬라이드 조각조각은 저자가 직접 작성해서 공개<ref>http://www.robots.ox.ac.uk/~tvg/publications/talks/fast-rcnn-slides.pdf</ref>한 것.
-그냥 R-CNN<ref name=r9>R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.</ref>은 이런가봄 : R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned. … Detection with VGG16 takes 47s / image (on a Nvidia K40 GPU overclocked to 875 MHz.). 이야 ~
+그냥 R-CNN<ref name=r9>R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.</ref> : R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned. … Detection with VGG16 takes 47s / image (on a Nvidia K40 GPU overclocked to 875 MHz.). <del>엄청나게 느림</del>
-그냥 R-CNN은 object proposal마다 cnn forward하는데, SPPnets<ref name=r11>K.He, X.Zhang, S.Ren, and J.Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV,2014.</ref>가 미리 cnn돌려놓고 거기서부터 feature뽑아내는 식으로 test time은 10~100배, training time도 3배정도 개선했다. 단, SPPnets는 R-CNN과 달리 spatial pyramid pooling앞의 convolutional layers를 update할 수 없다.
+<pdf height=610>file:Frcnn2.pdf</pdf>
-입력으로는 이미지와 object proposals를 받는다. 먼저 이미지가 convnet지나면서 feature map을 만들고 이 feature map과 앞의 object proposal로부터 RoI pooling layer가 일정한 길이의 feature vector들을 뽑아낸다. 이 feature vector들이 fc를 지나가면서 두가지 출력을 내는데 하나는 클래스정보(K object class + ‘background’의 softmax), 다른 하나는 영역(refined bounding box by <i>category-specific</i> bounding-box regressors. 각 클래스마다 네개의 좌표).
+그냥 R-CNN은 object proposal마다 cnn forward하는데, SPPnets<ref name=r11>K.He, X.Zhang, S.Ren, and J.Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV,2014.<br><pdf height=610>file:frcnn4.pdf</pdf></ref>가 미리 cnn돌려놓고 거기서부터 feature뽑아내는 식으로 test time은 10~100배, training time도 3배정도 개선했다. 단, SPPnets는 R-CNN과 달리 spatial pyramid pooling앞의 convolutional layers를 update할 수 없다.<ref>pyramid pooling하기 때문임. <del>억지로 하려면 뭐 못할것도 없겠다만…</del><br><pdf height=610>file:frcnn7.pdf</pdf></ref>
-이미지 하나당 RoI하나씩 해서 학습하는 것(SPPnet과 R-CNN은 이렇게 한다)보다 같은 RoI개수라도 적은 이미지 수를 사용하면 학습이 빠르다. 이미지 하나당 cnn한번만 통과하면 RoI마다 feature를 얻어내기 때문이다(cnn결과를 share함). 보통 이미지 전체가 RoI로 잡히는 일이 많기 때문에 이렇게 하면 계산속도 이득이 크다. 이미지당 중복된 RoI를 뽑을 때 서로간 correlation이 문제될 수 있지만, 실제 실험(이미지 당 64개씩 RoI, batch size=2)결과 괜찮았다.
+입력으로는 이미지와 object proposals를 받는다. 먼저 이미지가 convnet지나면서 feature map을 만들고 이 feature map과 앞의 object proposal로부터 RoI pooling layer가 일정한 길이의 feature vector들을 뽑아낸다. 이 feature vector들이 fc를 지나가면서 두가지 출력을 내는데 하나는 클래스정보(K object class + ‘background’의 softmax), 다른 하나는 영역(refined bounding box by <i>category-specific</i> bounding-box regressors. 각 클래스마다 네개의 좌표). <ref><pdf height=610>file:frcnn9.pdf</pdf></ref>
 loss는 다음을 쓴다.
@@ 27번째 줄: / 29번째 줄: @@
 학습할 때, RoI중 25%는 IoU(intersection over union) \(>0.5\)에서, 나머지는 IoU\(=[0.1, 0.5)\)에서 썼다. 구간 시작값 \(0.1\)은 heuristic하게 잡은 것이다. background는 IoU=\(0\)이다. 이미지는 \(0.5\)의 확률로 horizontally filp이고 이 외에 augmentation은 하지 않았다.
-RoI pooling layer에서 back propagation은 다음과 같은데,
+RoI pooling layer에서 back propagation은 다음과 같다.
 $$\frac{\partial L}{\partial x_i} = \sum_r \sum_j [ i = i^* (r,j)] \frac{\partial L}{\partial y_{rj}} $$
-대강 이해하기로는, one batch안에서 (각 이미지당) \(x_i\)가 최대일 때만 loss계산해서 bp하겠다는 말 같은데, ‘\(x_i\)가 최대’라는게 무슨 뜻인지 모르겠다. 그냥 norm인가.
+그림을 보면 더 감이 잘 온다.
-원문은 다음과 같다.
+제대로 이해 했는지 모르겠다만<ref>원문은 다음과 같다.
-<blockquote>
+<poem><blockquote>Let \(x_i ∈ \R\) be the \(i\)-th activation input into the RoI pooling layer and let \(y_{rj}\) be the layer’s \(j\)-th output from the \(r\)-th RoI. The RoI pooling layer computes \(y_{rj} = x_{i^{∗}(r,j)}\), in which \(i^{∗}(r, j) = \text{argmax}_{ i'∈\mathcal{R}(r,j)} x_{i'}\) . \(\mathcal{R}(r, j) \) is the index set of inputs in the sub-window over which the output unit \(y_{rj}\) max pools. A single \(x_i\) may be assigned to several different outputs \(y_{rj}\) .<br>(중략)<br>
-Let \(x_i ∈ \R\) be the \(i\)-th activation input into the RoI pooling layer and let \(y_{rj}\) be the layer’s \(j\)-th output from the \(r\)-th RoI. The RoI pooling layer computes \(y_{rj} = x_{i^{∗}(r,j)}\), in which \(i^{∗}(r, j) = \text{argmax}_{ i'∈\mathcal{R}(r,j)} x_{i'}\) . \(\mathcal{R}(r, j) \) is the index set of inputs in the sub-window over which the output unit \(y_{rj}\) max pools. A single \(x_i\) may be assigned to several different outputs \(y_{rj}\) .<br>(중략)<br>
+In words, for each mini-batch RoI \(r\) and for each pooling output unit \(y_{rj}\), the partial derivative \(∂L/∂y_{rj}\) is accumulated if \(i\) is the argmax selected for \(y_{rj}\) by max pooling.</blockquote></poem></ref>, ① max pooling해서 나온 후보군의 경우 오직 max값을 가진 pixel만 넘어오게 되어 있으므로, ([[unpooling switch]]를 사용하여) 해당 pixel에 대해서만 gradient를 계산할 것이고, ② roi를 뽑는 과정에서 computation을 share하게 되어 있으므로 한 픽셀이 여러곳으로 들어갈 수 있는데, bp되는 모든 값을 sum하겠다는 뜻.
 Scale invariance위해 두가지 방법을 해봄. 하나는 이미지 사이즈를 일정하게 고정하고 brutal force하게 RoI를 주는법, 하나는 이미지 피라미드에서 취하는 법. 첫번째 방법은 net이 모든 object size에 대해 학습해야 하고(대부분의 실험을 이렇게 함) 두번째 방법은 RoI가 거의 일정하게 유지된다. 실험할때는 \(224^2\) pixel에 최대한 가깝게 했다. 실험결과는 SPPnet과 일치하는데, multi-scale이 아주 약간 더 좋기는 하지만, 둘의 성능이 거의 동일하다. 따라서 다른 모든 실험은 single-scale로 이루어졌다.

@@ 1번째 줄: / 1번째 줄: @@
 Ross Girshick  <ref>혼자 썼는데 본문의 모든 주어가 We. 관행인가봄. <br><span class=gray>아니 그럼 [https://ko.wikipedia.org/wiki/체스터_윌러드 체스터 윌러드]는 뭐야.</span></ref>
-Microsoft Research
+Microsoft Research<ref>지금은 페북에 있는듯</ref>
 [https://github.com/rbgirshick/fast-rcnn github]
 [https://arxiv.org/abs/1504.08083 arXiv:1504.08083]
 그냥 R-CNN<ref name=r9>R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.</ref>은 이런가봄 : R-CNN first finetunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned. … Detection with VGG16 takes 47s / image (on a Nvidia K40 GPU overclocked to 875 MHz.). 이야 ~

← 이전 판		2017년 8월 4일 (금) 09:31 판
45번째 줄:		45번째 줄:
	맨 마지막 softmax를 SVM으로 바꾸어 보았는데 softmax가 조금 더 나았다. stagewise보다 one-shot learning이 더 낫다는 또 하나의 증거.		맨 마지막 softmax를 SVM으로 바꾸어 보았는데 softmax가 조금 더 나았다. stagewise보다 one-shot learning이 더 낫다는 또 하나의 증거.

−	proposal을 엄청 늘려도 보았는데 별 도움 안되었다. 매우 과다하게 늘리면 오히려 mAP를 떨어트린다. “sparse object proposals are better”	+	proposal을 엄청 늘려도 보았는데 별 도움 안되었다. 매우 과다하게 늘리면 오히려 mAP를 떨어트린다. “sparse object proposals are better” <ref>논문에서 주로 selective search사용하는 것으로 보였음. low-level feature사용해서 superpixel을 greedy하게 합하는 방법</ref>

← 이전 판		2017년 8월 4일 (금) 09:04 판
1번째 줄:		1번째 줄:
−	Ross Girshick <ref>혼자 썼는데 본문의 모든 주어가 We. 관행인가봄. <br>아니 그럼 [https://ko.wikipedia.org/wiki/체스터_윌러드 체스터 윌러드]는 뭐야.</ref>	+	Ross Girshick <ref>혼자 썼는데 본문의 모든 주어가 We. 관행인가봄. <br><span class=gray>아니 그럼 [https://ko.wikipedia.org/wiki/체스터_윌러드 체스터 윌러드]는 뭐야.</span></ref>

	Microsoft Research		Microsoft Research

@@ 37번째 줄: / 37번째 줄: @@
 Scale invariance위해 두가지 방법을 해봄. 하나는 이미지 사이즈를 일정하게 고정하고 brutal force하게 RoI를 주는법, 하나는 이미지 피라미드에서 취하는 법. 첫번째 방법은 net이 모든 object size에 대해 학습해야 하고(대부분의 실험을 이렇게 함) 두번째 방법은 RoI가 거의 일정하게 유지된다. 실험할때는 \(224^2\) pixel에 최대한 가깝게 했다. 실험결과는 SPPnet과 일치하는데, multi-scale이 아주 약간 더 좋기는 하지만, 둘의 성능이 거의 동일하다. 따라서 다른 모든 실험은 single-scale로 이루어졌다.
-Detection할 때, 더 빠르게 하기 위해 Truncated SVD<ref name=r5></ref><ref name=r23></ref> 쓸 수도 있다. weight vector \(W \approx U\Sigma_t V^T\), U는 \(u \times t\) matrix이고 \(W\)의 first \(t\) left-singular vectors. 이렇게 하면 parameter가 \(uv\)개 에서 \(t(u+v)\)로 줄어들어서 \(t < \min (u, v)\)일 때 효과가 좋다. VOC07대상으로 한 실험에서 FRCN은 VGG16보다 146배 빠른데, truncated SVD까지 하면 213배 빠르다. SPPnet과 비교하면 각 7, 10배. 정확도는 0.3%정도 저하된다.
+Detection할 때, 더 빠르게 하기 위해 Truncated SVD<ref name=r5>E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.</ref><ref name=r23>J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, 2013.</ref> 쓸 수도 있다. weight vector \(W \approx U\Sigma_t V^T\), U는 \(u \times t\) matrix이고 \(W\)의 first \(t\) left-singular vectors. 이렇게 하면 parameter가 \(uv\)개 에서 \(t(u+v)\)로 줄어들어서 \(t < \min (u, v)\)일 때 효과가 좋다. VOC07대상으로 한 실험에서 FRCN은 VGG16보다 146배 빠른데, truncated SVD까지 하면 213배 빠르다. SPPnet과 비교하면 각 7, 10배. 정확도는 0.3%정도 저하된다.
 많이 깊지 않은 net에 대해서는 마지막 fc nets만 fine-tuning하면 된다고 알려져 있고<ref name=r11 />, deep한 net에 대해서는 그렇지 않음을 실험으로 확인했다. 곧, RoI pooling layer를 통한 학습이 중요하다는 뜻이다. 그렇다고 모든 conv layer가 fine-tune되어야 하는가 하면 그것도 아니다. 실험결과 conv-1은 generic해서, fine tune의 효과가 좋지 않았다. task-specific하게 결정하면 된다. 이 paper에서 VGG16은 모두 conv3_1 보다 이후의 layers(12개중 9개)만 학습시켰다. (적은 양을 update하면 GPU메모리문제를 피할 수 있다. )
@@ 46번째 줄: / 46번째 줄: @@
 proposal을 엄청 늘려도 보았는데 별 도움 안되었다. 매우 과다하게 늘리면 오히려 mAP를 떨어트린다. “sparse object proposals are better”
 ----
 <references/>
 <disqus/>