"0926 information bottleneck"의 두 판 사이의 차이

2017년 9월 27일 (수) 10:14 기준 최신판

New Theory Cracks Open the Black Box of Deep Learning

The information bottleneck method (Naftali Tishby, Fernando C. Pereira, William Bialek. 이 주제에 관해 가장 처음 나온 저자의 논문)

(아마도) 콴타매거진에 나오게 된 계기가 된 논문 Opening the Black Box of Deep Neural Networks via Information

문장구조도 정확히 파악은 안되지만, 대충 마지막 문장만 보고 의미를 이해했다. 이 생각에 매우 동의한다.
According to Tishby, who views the information bottleneck as a fundamental principle behind learning, whether you’re an algorithm, a housefly, a conscious being, or a physics calculation of emergent behavior, that long-awaited answer “is that the most important part of learning is actually forgetting.”

Tishby의 의견으로는 Shannon이 ‘information is not about semantics’라는 관점을 가졌다는데 이거 무슨뜻인지 모르겠다. (Shannon이 이런 관점을 가졌었나?)

Tishby는 relevance를 precisely정의할 수 있다고 주장.^[1]
아주 예전부터 이 생각을 해오다가 An exact mapping between the Variational Renormalization Group and Deep Learning을 보고, 자신의 생각과 deep learning(이하 DL)간 관련이 있다는 영감을 얻었다고 한다. network의 동작이 기존에 물리학에서 이미 널리 알려진 ‘renormalization’과정(coarse-graining procedure)과 완전히 동일함을 보인 논문.~~이라는데 renormalization이 뭐지. ㅎㅎ. 예전에 이것도 역시 Quanta에서 본 것 같은데 1도 기억 안남. ㅋㅋ~~ Tishby가 주시한 문제는 이 과정이 전제하는 자기 반복적 성격(fractal)이 현실에는 나타나지 않는다는것. ~~왠지 현실도 fractal같은데?~~ 그러다 DL과 renormalization이 더 큰 관점에서 통합될 수 있다는 생각에 이르렀다고 한다.
2015년에 Tishby와 그의 학생인 Noga Zaslavsky는(Deep Learning and the Information Bottleneck Principle) DL이 쓸모있는 데이터만 남기는 최대한의 압축과정이라는 가정을 한 적이 있다. 이번에 Shwartz-Ziv의 실험에서 입력과 출력의 정보손실을 관찰한 결과 network의 각 층이 information bottleneck의 이론적 bound로 수렴함을 확인하게 된다. (이 이론적 한계는 The information bottleneck method에서 제시되었다.)
이 둘이 발견한 또 하나는, DL이 짧은 ‘fitting’과정과, 긴 ‘compression’과정을 거친다는 것이다. 학습의 초기단계에서 입력을 저장하는 정도는 거의 일정하거나 조금씩 증가하면서 점점 더 target에 fit하게 된다. 그 다음 compression과정에서는 net이 가장 강력한 정보(label과 가장 관련성이 높은 정보)만 출력하기 시작한다. 우발적 상관관계가 있을 때 SGD가 net을 randomize하는 효과가 있어서 ‘잊게’ 만드는 것이다. ^[2] 예를들어, 개 사진은 개집과 같이 나올 확률이 상대적으로 높은데, 이 ‘집’에 상관된 요소들을 잊는 과정이다.

이 과정이 DL전체를 지배하는 것인지, DL을 이해하기 위해 다른 요소들이 더 필요한지는 아직 알려져 있지 않다. Andrew Saxe는 매우 deep한 net에는 compression과정이 (early stopping하면 되기 때문에) 필요 없어 보인다고 한다. 이에 대해 Tishby는 Saxe의 실험에 사용된 것이 통상적인 DNN과 구조가 다르기 때문이라고 반박한다. Tishby와 Shwartz-Ziv는 최근에 큰 net(330k connections)에 대해 실험을 수행했는데, MINST data에 대해서 실험해본 결과, 두 phase를 모두 관찰할 수 있었고, 오히려 작은 net을 사용했을 때보다 두 phase의 구분이 확연함을 확인했다^[3]^[4]. 그래서 Tishby는 “I’m completely convinced now that this is a general phenomenon,”라고.

최근에 연구자들은 원리파악보다는 성능향상에 정신이 팔려있지만^[5] Tishby의 이런 연구가 학습의 근본적인 측면을 더 잘 이해할 수 있게 해주기를 그들 또한 기대하고 있다.

Brenden Lake는 이 발견이 DL을 이해하는 데 중요한 발견이라는 것을 인정하면서도 인간의 학습과는 매우 다르다고 한다. 인간의 뇌는 훨씬 더 복잡하고(수백조의 연결, 860억개의 뉴런), 단 하나의 예제로부터도 학습 가능하다. 어린 아이가 글자를 배울 때, 글자들의 여러 패턴을 배우는 것이 아니라 조각조각의 stroke들로부터 더 큰 쪽으로 construct하기 때문이다. 이러한 관찰은 AI community에도 시사점이 있는데, 이로부터 ‘DL을 이용해 해결 가능한 문제’가 무엇인지 더 명확하게 알 수 있다. 이미지나 음성인식은 해결될 수 있는 문제의 대표격이다. (Tishby의 의견대로) cryptographic codes는 ‘해결될 수 없는’ 문제일 것이다.

↑ precisely 어떻게 정의한다는건지는 끝까지 안나온다. paper봐야 결국 알 수 있을듯 -_-
↑ 이건 너무 뻔한 소리가 아닌가 하는데, 일단 원문을 그대로 옮겨둔다.
Then learning switches to the compression phase. The network starts to shed information about the input data, keeping track of only the strongest features — those correlations that are most relevant to the output label. This happens because, in each iteration of stochastic gradient descent, more or less accidental correlations in the training data tell the network to do different things, dialing the strengths of its neural connections up and down in a random walk. This randomization is effectively the same as compressing the system’s representation of the input data.
↑ 이건 왜 출처가 없는거야. 중요해보이는데.
그리고 이거 In Defense of the Triplet Loss for Person Re-Identification의 실험결과와도 일치해 보이는데, 둘이 같은 현상을 말하고 있는건지는 확인이 필요하다.
↑ deep한 net에 information bottleneck을 근사하는 방법에 관해 Google Research의 Alex Alemi가 쓴 논문이 있다. Deep Variational Information Bottleneck ICLR2017
원래 기사에는 첫부분에 「Tishby의 발견으로 커뮤니티가 buzzing하고 있다」는 예로 잠깐 나온다.
↑ 이건 내 멋대로 번역이고 원문은 다음이다.
AI practitioners have since largely abandoned that path in the mad dash for technological progress, instead slapping on bells and whistles that boost performance with little regard for biological plausibility.

[1] recisely 어떻게 정의한다는건지는 끝까지 안나온다. paper봐야 결국 알 수 있을듯 -_-

[2] 이건 너무 뻔한 소리가 아닌가 하는데, 일단 원문을 그대로 옮겨둔다.
Then learning switches to the compression phase. The network starts to shed information about the input data, keeping track of only the strongest features — those correlations that are most relevant to the output label. This happens because, in each iteration of stochastic gradient descent, more or less accidental correlations in the training data tell the network to do different things, dialing the strengths of its neural connections up and down in a random walk. This randomization is effectively the same as compressing the system’s representation of the input data.

[3] 이건 왜 출처가 없는거야. 중요해보이는데.
그리고 이거 In Defense of the Triplet Loss for Person Re-Identification의 실험결과와도 일치해 보이는데, 둘이 같은 현상을 말하고 있는건지는 확인이 필요하다.

[4] 한 net에 information bottleneck을 근사하는 방법에 관해 Google Research의 Alex Alemi가 쓴 논문이 있다. Deep Variational Information Bottleneck ICLR2017
원래 기사에는 첫부분에 「Tishby의 발견으로 커뮤니티가 buzzing하고 있다」는 예로 잠깐 나온다.

[5] 이건 내 멋대로 번역이고 원문은 다음이다.
AI practitioners have since largely abandoned that path in the mad dash for technological progress, instead slapping on bells and whistles that boost performance with little regard for biological plausibility.

[1]

[2]

[3]

[4]

[5]

@@ 13번째 줄: / 13번째 줄: @@
   According to Tishby, who views the information bottleneck as a fundamental principle behind learning, whether you’re an algorithm, a housefly, a conscious being, or a physics calculation of emergent behavior, that long-awaited answer “is that the most important part of learning is actually forgetting.”
-  Tishby의 의견으로는 Shannon이 ‘information is not about semantics’라는 관점을 가졌다는데 이거 무슨뜻인지 모르겠다.
+  Tishby의 의견으로는 Shannon이 ‘information is not about semantics’라는 관점을 가졌다는데 이거 무슨뜻인지 모르겠다. (Shannon이 이런 관점을 가졌었나?)
   Tishby는 relevance를 precisely정의할 수 있다고 주장.<ref>precisely 어떻게 정의한다는건지는 끝까지 안나온다. paper봐야 결국 알 수 있을듯 -_-</ref>
@@ 21번째 줄: / 21번째 줄: @@
   Then learning switches to the compression phase. The network starts to shed information about the input data, keeping track of only the strongest features — those correlations that are most relevant to the output label. This happens because, in each iteration of stochastic gradient descent, more or less accidental correlations in the training data tell the network to do different things, dialing the strengths of its neural connections up and down in a random walk. This randomization is effectively the same as compressing the system’s representation of the input data. </ref> 예를들어, 개 사진은 개집과 같이 나올 확률이 상대적으로 높은데, 이 ‘집’에 상관된 요소들을 잊는 과정이다.
-  이 과정이 DL전체를 지배하는 것인지, DL을 이해하기 위해 다른 요소들이 더 필요한지는 아직 알려져 있지 않다. [http://www.people.fas.harvard.edu/~asaxe/ Andrew Saxe]는 매우 deep한 net에는 compression과정이 (early stopping하면 되기 때문에) 필요 없어 보인다고 한다. 이에 대해 Tishby는 Saxe의 실험은 통상적인 DNN과 구조가 다르기 때문이라고 반박한다. Tishby와 Shwartz-Ziv는 최근에 큰 net(330k connections)에 대해 실험을 수행했는데, MINST data에 대해서 실험해본 결과, 두 phase를 모두 관찰할 수 있었고, 오히려 작은 net을 사용했을 때보다 두 phase의 구분이 확연함을 확인했다<ref>이건 왜 출처가 없는거야. 중요해보이는데.
+  이 과정이 DL전체를 지배하는 것인지, DL을 이해하기 위해 다른 요소들이 더 필요한지는 아직 알려져 있지 않다. [http://www.people.fas.harvard.edu/~asaxe/ Andrew Saxe]는 매우 deep한 net에는 compression과정이 (early stopping하면 되기 때문에) 필요 없어 보인다고 한다. 이에 대해 Tishby는 Saxe의 실험에 사용된 것이 통상적인 DNN과 구조가 다르기 때문이라고 반박한다. Tishby와 Shwartz-Ziv는 최근에 큰 net(330k connections)에 대해 실험을 수행했는데, MINST data에 대해서 실험해본 결과, 두 phase를 모두 관찰할 수 있었고, 오히려 작은 net을 사용했을 때보다 두 phase의 구분이 확연함을 확인했다<ref>이건 왜 출처가 없는거야. 중요해보이는데.
   그리고 이거 [[In Defense of the Triplet Loss for Person Re-Identification]]의 실험결과와도 일치해 보이는데, 둘이 같은 현상을 말하고 있는건지는 확인이 필요하다.</ref><ref>deep한 net에 information bottleneck을 근사하는 방법에 관해 Google Research의 [https://research.google.com/pubs/104980.html  Alex Alemi]가 쓴 논문이 있다. [https://arxiv.org/abs/1612.00410 Deep Variational Information Bottleneck] ICLR2017
   원래 기사에는 첫부분에 「Tishby의 발견으로 커뮤니티가 buzzing하고 있다」는 예로 잠깐 나온다.</ref>. 그래서 Tishby는  “I’m completely convinced now that this is a general phenomenon,”라고.
   최근에 연구자들은 원리파악보다는 성능향상에 정신이 팔려있지만<ref>이건 내 멋대로 번역이고 원문은 다음이다.
-  AI practitioners have since largely abandoned that path in the mad dash for technological progress, instead slapping on bells and whistles that boost performance with little regard for biological plausibility. </ref> 이런 연구가 학습의 근본적인 측면을 더 잘 이해할 수 있게 해주기를 그들 또한 기대하고 있다.
+  AI practitioners have since largely abandoned that path in the mad dash for technological progress, instead slapping on bells and whistles that boost performance with little regard for biological plausibility. </ref> Tishby의 이런 연구가 학습의 근본적인 측면을 더 잘 이해할 수 있게 해주기를 그들 또한 기대하고 있다.
-  [http://cims.nyu.edu/~brenden/ Brenden Lake]는 이 발견이 DL을 이해하는 데 중요한 발견이라는 것을 인정하면서도 인간의 학습과는 매우 다르다고 한다. 인간의 뇌는 훨씬 더 복잡하고(수백조의 연결, 860억개의 뉴런), 단 하나의 예제로부터도 학습 가능하다. 어린 아이가 글자를 배울 때, 글자들의 여러 패턴을 배우는 것이 아니라 조각조각의 stroke들로부터 더 큰 쪽으로 construct하기 때문이다. 이러한 관찰은 AI community에도 시사점이 있는데, 이로부터 ‘DL을 이용해 해결 가능한 문제’가 무엇인지 더 명확하게 알아낼 수 있기 때문이다. 이미지나 음성인식은 해결될 수 있는 문제의 대표격이다. (Tishby의 의견대로) cryptographic codes는 ‘해결될 수 없는’ 문제일 것이다.
+  [http://cims.nyu.edu/~brenden/ Brenden Lake]는 이 발견이 DL을 이해하는 데 중요한 발견이라는 것을 인정하면서도 인간의 학습과는 매우 다르다고 한다. 인간의 뇌는 훨씬 더 복잡하고(수백조의 연결, 860억개의 뉴런), 단 하나의 예제로부터도 학습 가능하다. 어린 아이가 글자를 배울 때, 글자들의 여러 패턴을 배우는 것이 아니라 조각조각의 stroke들로부터 더 큰 쪽으로 construct하기 때문이다. 이러한 관찰은 AI community에도 시사점이 있는데, 이로부터 ‘DL을 이용해 해결 가능한 문제’가 무엇인지 더 명확하게 알 수 있다. 이미지나 음성인식은 해결될 수 있는 문제의 대표격이다. (Tishby의 의견대로) cryptographic codes는 ‘해결될 수 없는’ 문제일 것이다.