๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๋จธ์‹ ๋Ÿฌ๋‹, ๋”ฅ๋Ÿฌ๋‹/OCR

ResNet ๊ณต๋ถ€

by ํ–‰๋ฑ 2019. 7. 31.

์ฝ์€ ์ž๋ฃŒ: https://dnddnjs.github.io/cifar10/2018/10/09/resnet/

 

Fine tuning

https://eehoeskrap.tistory.com/186

 

Parameter & Hyperparameter

https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training

Neural network์˜ Parameter๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ Connection์˜ Weights๋ฅผ ๋งํ•œ๋‹ค. ์ด Parameters๋Š” Training stage์—์„œ ํ•™์Šต๋œ๋‹ค. ๊ทธ๋ž˜์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ž์ฒด (๊ทธ๋ฆฌ๊ณ  ์ธํ’‹ ๋ฐ์ดํ„ฐ๋Š”) ์ด Parameters๋ฅผ ํŠœ๋‹ํ•œ๋‹ค.

Hyperparameter๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ Learning rate, Batch size, # of Epochs๋ฅผ ๋งํ•œ๋‹ค. ์–ด๋–ป๊ฒŒ Parameters๊ฐ€ ํ•™์Šต๋˜๋Š”์ง€์— ์˜ํ–ฅ์„ ๋ผ์นœ๋‹ค๊ณ  ํ•ด์„œ "Hyper"๋ผ๊ณ  ๋ถˆ๋ฆฐ๋‹ค. ์ด Hyperparameter๋ฅผ ๋‹ค์Œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค: Grid search, Random search, By hand, Using visualization ๋“ฑ. Validation stage์—์„œ Parameters๊ฐ€ ์ถฉ๋ถ„ํžˆ ํ•™์Šต ๋˜์—ˆ๋Š”์ง€, Hyperparameter๊ฐ€ ์ข‹์€์ง€ (์ž˜ ์„ค์ • ๋˜์—ˆ๋Š”์ง€) ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)

In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.

 

https://blogyong.tistory.com/8 : Hyperparameter์˜ ์ข…๋ฅ˜์™€ ์ตœ์ ํ™”์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•

 

SGD (Stochastic Gradient Descent)

http://shuuki4.github.io/deep%20learning/2016/05/20/Gradient-Descent-Algorithm-Overview.html

์ฃผ์š” ๋‚ด์šฉ: SGD ๊ฐœ๋…๊ณผ SGD์˜ Variations

์ฒ˜์Œ ๊ทธ๋ฆผ ์ž๋ฃŒ ๋‚˜์˜ค๊ธฐ ์ „๊นŒ์ง€ Gradient Descent ๊ฐœ๋…, Batch Gradient Descent์™€ Stochastic Gradient Descent๋ฅผ ๋น„๊ตํ•˜๋Š” ๋‚ด์šฉ์ด ๋‚˜์˜จ๋‹ค. ๊ฐ„๋‹จํžˆ ์š”์•ฝํ•˜๋ฉด BGD๋Š” Loss function์„ ๊ณ„์‚ฐํ•  ๋•Œ Training data ์ „์ฒด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๊ณ  SGD๋Š” Training data ์ž‘์€ ์กฐ๊ฐ(์ฆ‰ Mini-batch)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. SGD๋Š” BGD์— ๋น„ํ•ด ๋ถ€์ •ํ™•ํ•  ์ˆ˜๋Š” ์žˆ์ง€๋งŒ 1) ๊ณ„์‚ฐ ์†๋„๊ฐ€ ๋นจ๋ผ ๊ฐ™์€ ์‹œ๊ฐ„์— ๋” ๋งŽ์€ Step์„ ๊ฐˆ ์ˆ˜ ์žˆ๊ณ  2) BGD๊ฐ€ Local minimum์— ๋น ์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์— ๋น„ํ•ด ๊ทธ๋Ÿด ์œ„ํ—˜์ด ์ ๊ณ  3) ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณตํ•˜๋ฉด ๋ณดํ†ต BGD์™€ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค.
* f(x)์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ∇f ๋ผ๊ณ  ํ•œ๋‹ค. ∇๋Š” ๋ฒกํ„ฐ ๋ฏธ๋ถ„ ์—ฐ์‚ฐ์ž๋กœ, ๋‚˜๋ธ”๋ผ(nabla) ๋˜๋Š” ๋ธ(del) ์—ฐ์‚ฐ์ž๋ผ๊ณ  ํ•œ๋‹ค. (์œ„ํ‚ค ๊ธฐ์šธ๊ธฐ(๋ฒกํ„ฐ) ์ฐธ๊ณ ) 

์ฒ˜์Œ ๊ทธ๋ฆผ ์ž๋ฃŒ๋Š” ๋‹จ์ˆœ SGD์™€ SGD์˜ ์—ฌ๋Ÿฌ Variations๋ฅผ ๋น„๊ตํ•œ ๊ฒƒ์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๋‹จ์ˆœ SGD๋Š” ๊ทธ Variations์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. (๋Š๋ฆฌ๊ณ , ๋ฐฉํ–ฅ๋„ ์ ์ ˆ์น˜ ์•Š์Œ) ๊ทธ๋ž˜์„œ ์ดํ›„๋ถ€ํ„ฐ ์—ฌ๋Ÿฌ Variations๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. ์ˆ˜์‹์ด ๋‚˜์˜ค๊ธด ํ•˜๋Š”๋ฐ ์ฝ์„ ๋งŒํ•œ ์ •๋„์ธ ๊ฒƒ ๊ฐ™๋‹ค. ๋‹ค ์ฝ์ง€๋Š” ์•Š์•˜์ง€๋งŒ.. Variations ์ข…๋ฅ˜๋ฅผ ๋‚˜์—ดํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
Momentum, NAG, Adagrad, RMSProp, AdaDelta, Adam

๊ฒฐ๊ตญ Gradient Descent๋ฅผ ํ•  ๋•Œ ์–ด๋–ป๊ฒŒ Step์„ ์ด๋™ํ•  ๊ฒƒ์ด๋ƒ (๊ทธ๋ž˜์„œ ์–ด๋–ป๊ฒŒ Optimizeํ•  ๊ฒƒ์ด๋ƒ) ์— ๋Œ€ํ•œ ๋ฌธ์ œ์ด๊ธฐ ๋•Œ๋ฌธ์— tf์—์„œ ๋ดค๋˜ Optimizer ์ข…๋ฅ˜๋“ค์ด๋‹ค.

 

Internal Covariate Shift & Batch Normalization

http://dongyukang.com/%EB%B0%B0%EC%B9%98-%EC%A0%95%EA%B7%9C%ED%99%94-%EB%85%BC%EB%AC%B8%EC%9D%84-%EC%9D%BD%EA%B3%A0/

์ฃผ์š” ๋‚ด์šฉ: Internal covariate shift ๊ฐœ๋…๊ณผ Batch normalization ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๊ด„ ๋ฐ ์žฅ์ 

NN์—์„œ Hyperparameters๋Š” ์—ฌ๋Ÿฌ ๋ณต์žกํ•œ ๊ณ„์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์ง€๋งŒ ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š”๋ฐ ๊ทธ ๋ฌธ์ œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค.
1) Overfitting
2) Internal covariate shirt

Overfitting์€ ๋‚ด๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” ๋Œ€๋กœ Training data๋ฅผ ๋„ˆ๋ฌด ๊ณผํ•˜๊ฒŒ ํ•™์Šตํ•˜์—ฌ Training accuracy๋Š” ์ข‹์€๋ฐ Test accuracy๋Š” ๋–จ์–ด์ง€๋Š” ํ˜„์ƒ์ด๋‹ค. ์ฆ‰ Generalization์ด ์ž˜ ์•ˆ ๋œ ๊ฒƒ์ด๋‹ค. ๋˜ ๋‹ค๋ฅธ ๋ฌธ์ œ์ ์€ ๋ช‡๋ช‡ Weights๊ฐ€ ํ‰๊ท  Weights ๊ฐ’๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ ๊ณ„์†ํ•ด์„œ Weighted sum์ด ๋˜๋ฉด์„œ ๊ฐ ๋ ˆ์ด์–ด์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ์ด๊ฒƒ์„ Internal covariate shift๋ผ๊ณ  ํ•œ๋‹ค. (Weighted sum์€ xi * wi์˜ ํ•ฉ ์ฆ‰ ์ผ๋ฐ˜์ ์œผ๋กœ Activation function์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” ๊ฐ’์„ ๋งํ•˜๋Š” ๋“ฏ) ์•ž์—์„œ์˜ ์ž‘์€ ๋ณ€ํ™”๊ฐ€ ๋’ค๋กœ ๊ฐˆ์ˆ˜๋ก ์ปค์ ธ ๋ฌธ์ œ๋ฅผ ๋นš๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ๊ฒƒ์œผ๋กœ ์—ฌ๊ธฐ์„œ๋Š” ๊ฐ€์กฑ ์˜ค๋ฝ๊ด€์˜ '๊ณ ์š” ์†์˜ ์™ธ์นจ'์— ๋น„์œ ํ–ˆ๋‹ค. Gradient vanishing/exploding์ด ์ด๋Ÿฐ ๋ฌธ์ œ์— ํ•ด๋‹นํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. Internal covariate shift์™€๋Š” ๋‹ค๋ฅธ ๋ฌธ์ œ๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๋Š”๋ฐ ์•„๋‹Œ๊ฐ€๋ณด๋‹ค. Internal covariate shift๋ผ๋Š” ํ‘œํ˜„์„ 2015๋…„ ๋…ผ๋ฌธ์—์„œ ์ฒ˜์Œ ์ผ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ ์ด๊ฑด ๋ฌธ์ œ๋ฅผ ๋ถ„ํฌ์—์„œ ์ฐพ์•„์„œ ์ด๋ฆ„์„ ์ด๋ ‡๊ฒŒ ๋ถ™์ธ ๊ฒƒ์ด๊ณ  Gradient vanishing/exploding์€ ํ‘œ๋ฉด์ ์œผ๋กœ Gradient๊ฐ€ Vanishing ํ•˜๊ฑฐ๋‚˜ Exploding ํ•˜๋Š” ๊ฒƒ์„ 1์ฐจ์› ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ ์•„๋‹๊นŒ?? (๋‡Œํ”ผ์…œ) ์—ฌํŠผ ๊ฐ ๋ ˆ์ด์–ด๊ฐ€ ๋ฐ›์•„๋“ค์ด๋Š” ์ธํ’‹์˜ ๋ถ„ํฌ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๋‹ค๋Š” ๋ฌธ์ œ๋‹ค.

๊ฒฐ๊ตญ Batch normalization์€ ๊ฐ ๋ ˆ์ด์–ด๊ฐ€ ๋ฐ›์•„๋“ค์ด๋Š” ์ธํ’‹์˜ ๋ถ„ํฌ๋ฅผ ๊ฐ™๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋Œ€๋žต์ ์œผ๋กœ ํ‰๊ท =0 ํ‘œ์ค€ํŽธ์ฐจ=1์ธ ๋ถ„ํฌ๋กœ ๋ฐ”๊ฟ”์ค€๋‹ค๋Š” ๊ฐœ๋…์ด๋‹ค. ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ์— Scale๊ณผ Shift๋ฅผ ํ•˜๋Š” Parameters๋ฅผ ๋„์ž…ํ•ด Trainable ํ•˜๋„๋ก ํ•œ๋‹ค. (์•„๋ž˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๋ฒ ํƒ€์™€ ๊ฐ๋งˆ) ์—ฌ๊ธฐ์„œ๋Š” ๊ฒฐ๊ตญ Batch normalization์€ Extra layer๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค. Activation์œผ๋กœ ๊ฐ’์„ ๋„˜๊ฒจ ์ฃผ๊ธฐ ์ „์— Normalization์„ ์ˆ˜ํ–‰ํ•˜๋Š” Extra layer..

https://arxiv.org/pdf/1502.03167.pdf


Gradient vanishing/exploding ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ReLU, Regularizer weight decay (L2 or L1), Dropout ๋“ฑ์˜ ๋ฐฉ๋ฒ•์ด ์ œ์‹œ๋˜์—ˆ์—ˆ์ง€๋งŒ ์ด๊ฑด ๋ฐ์ดํ„ฐ์— ์ง์ ‘์ ์œผ๋กœ ์†์„ ๋Œ€์ง€ ์•Š๋Š” ๊ฐ„์ ‘์ ์ธ ๋ฐฉ๋ฒ•์ด์—ˆ๋‹ค. ์ด์— ๋น„ํ•ด Batch normalization์€ ์•„์˜ˆ ๋ฐ์ดํ„ฐ์— ์†์„ ๋Œ€๋Š” ์ง์ ‘์ ์ธ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” Batch normalization์˜ ์žฅ์ ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์„ค๋ช…ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.
1) ๊ฐ ๋ ˆ์ด์–ด์˜ ์ธํ’‹ ๋ถ„ํฌ๋ฅผ ๊ฐ™๊ฒŒ ํ•ด ์•ˆ์ •์ ์ธ ํ•™์Šต ๊ฐ€๋Šฅ
2) ๋†’์€ Learning rate๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด์„œ ํ•™์Šต ์†๋„ ๋นจ๋ผ์ง
3) ์ž์ฒด์ ์ธ Regularization ํšจ๊ณผ๊ฐ€ ์žˆ์–ด์„œ Weight decay๋‚˜ Dropout์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋จ
* ์›๋ž˜ ์ฝ๋˜ ๊ธ€์—์„œ ๋ณด๋ฉด Initialization๋„ ํฌ๊ฒŒ ์ƒ๊ด€ ์—†๋Š” ์žฅ์ ์ด ์žˆ๋‹ค๊ณ  ๋‚˜์˜จ๋‹ค.

๊ทผ๋ฐ Regularizer weight decay๊ฐ€ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค. ^_^..

 

LSTM Mechanism (Youtube)

https://www.youtube.com/watch?v=8HyCNIVRbSU&feature=youtu.be

ํ•œ ๋ฒˆ ๋ดค๋Š”๋ฐ ์ •๋ฆฌํ•˜๋ฉด์„œ ๋‹ค์‹œ ๋ณด๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

 

Identity Mapping

https://softwareengineering.stackexchange.com/questions/331981/what-is-identity-mapping-in-neural-networks

Identity mapping ensures that the output of some multilayer neural net is ensured to be equal to its input.
...
An identity map or identity function gives out exactly what it got.

์š”์•ฝ: ํ•ญ๋“ฑํ•จ์ˆ˜๋ž‘ ๋˜‘๊ฐ™์€ ๊ฐœ๋… (h(x) = x)

 

Degradation: Motivation to 1) Highway network and 2) ResNet

์ด ๋ถ€๋ถ„์€ ์›๊ธ€์— ์ž˜ ์„ค๋ช…๋˜์–ด ์žˆ์–ด์„œ ๋”ฐ๋กœ ์ฐพ์•„๋ณด์ง€๋Š” ์•Š์•˜๋‹ค.

- Overfitting: Training error๋Š” ์ค„์–ด๋“œ๋Š”๋ฐ Test error๋Š” ์ค„์–ด๋“ค์ง€ ์•Š๋Š” ํ˜„์ƒ
- Degradation: Layer๋ฅผ ๊นŠ๊ฒŒ ์Œ“์•˜๋Š”๋ฐ ์–•๊ฒŒ ์Œ“์„ ๋•Œ๋ณด๋‹ค Training error์™€ Test error๊ฐ€ ๋ชจ๋‘ ๋†’์€ ํ˜„์ƒ (Layer๋ฅผ ๊นŠ๊ฒŒ ์Œ“์œผ๋ฉด์„œ ํ•™์Šต์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์Œ)

์ด Degradation ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฒƒ์ด 1) Highway network์™€ 2) ResNet์ด๋‹ค.

์ผ๋‹จ ResNet์— ์ง‘์ค‘ํ•ด์„œ ๋ณด๊ธด ํ•  ๊ฑด๋ฐ Highway network์— ๋Œ€ํ•ด ์ดํ•ดํ•œ ๊ฒƒ์„ ๊ฐ„๋‹จํžˆ๋งŒ ์ •๋ฆฌํ•˜๋ฉด..
LSTM์—์„œ (์‹œ๊ฐ„์ ์œผ๋กœ) ์•ž์˜ ์ •๋ณด๋ฅผ ์œ ์ง€๋˜๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด Cell state๋ผ๋Š” ๊ฐœ๋…์„ ์“ด ๊ฒƒ๊ณผ ๋น„์Šทํ•˜๊ฒŒ ์•ž Layer์˜ ์ •๋ณด๋ฅผ (๋น„๊ต์  ์†์‹ค์ด๋‚˜ ๋ณ€ํ˜• ์—†์ด?) ๋’ท Layer์— ์ „๋‹ฌํ•˜๊ณ ์ž ํ•œ ๊ฒƒ์ด๋‹ค.

์›๋ž˜ ๋’ค Layer์— ์ „๋‹ฌํ•˜๋Š” ๊ฐ’์ด (x๋Š” ์ธํ’‹, W_H๋Š” Weight, H๋Š” Non-linear(Activation) function)
$$y = H(x, W_H)$$์˜€๋‹ค๋ฉด Highway network์—์„œ ๋’ค Layer์— ์ „๋‹ฌํ•˜๋Š” ๊ฐ’์€ $$y = t * H(x, W_H) + (1 - t) * x$$์ด๋‹ค. (๋‹จ, 0<= t <= 1)

1) ๊ทธ๋ƒฅ ์ธํ’‹๊ณผ 2) Non-linear๋ฅผ ํ†ต๊ณผํ•œ ์ธํ’‹์„ ์ ์ ˆํ•œ ๋น„์œจ๋กœ ๊ฐ€์ ธ์™€ ๋‹ค์Œ Layer๋กœ ๋ณด๋‚ด๋Š” ๊ฐœ๋…์ด๋‹ค.
๋‹จ t๋Š” ๊ฐ„๋‹จํžˆ ์“ด ๊ฒƒ์ด๊ณ  ์‹ค์ œ๋กœ๋Š” T(x, W_T)์ด๋‹ค. t๊ฐ’ ์ž์ฒด๋„ W_T๋ฅผ ํ•™์Šต์‹œ์ผœ์„œ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋„๋ก ํ•œ ๊ฒƒ์ด ์•„๋‹๊นŒ ์‹ถ๋‹ค. ๋˜ t๋Š” [0, 1] ๋ฒ”์œ„์˜ ๊ฐ’์ด๋ฏ€๋กœ Sigmoid๋ฅผ ์จ์„œ ๋งŒ๋“ ๋‹ค. ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. $$t = T(x, W_T)$$ $$T(x) = Sigmoid(W_T x + b_T)$$

๊ฐ‘์ž๊ธฐ ๋“  ์ƒ๊ฐ์ด..
์›๊ธ€์—์„œ NN์—์„œ (ํ•™์Šต์‹œ์ผœ์•ผ ํ•˜๋Š”) Parameters๋Š” Weight์™€ Bias๋ผ๊ณ  ํ–ˆ๋Š”๋ฐ ์œ„ ์‹์— ์žˆ๋Š” b_T๊ฐ€ ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” Bias๊ฐ€ ์•„๋‹๊นŒ ์‹ถ๋‹ค. Bias๊ฐ€ ์—ฌ๋Ÿฌ ๋งฅ๋ฝ์—์„œ ํ•ด์„๋  ์ˆ˜ ์žˆ์–ด์„œ ์•„๊นŒ ์ฐพ๋‹ค๊ฐ€ ๊ทธ๋ƒฅ ๋„˜์–ด๊ฐ”๋Š”๋ฐ.. ์•„๋งˆ ์ด๊ฑฐ ๊ฐ™๋‹ค.

 

ResNet

์›๋ž˜ ์–ด๋–ค Layer๋กœ ํ•™์Šต์‹œํ‚ค๊ณ  ์‹ถ์€ ํ•จ์ˆ˜๊ฐ€ H(x) ์˜€๋‹ค๋ฉด ResNet์€ H(x) = F(x) + x ๋กœ ๋‘๊ณ  F(x)๋ฅผ ํ•™์Šต์‹œํ‚จ๋‹ค. ๊ฒฐ๊ตญ ๋˜‘๊ฐ™์€ ๊ฒƒ ๊ฐ™์ง€๋งŒ x๋ฅผ ๋‹ค์Œ Layer๋กœ ๋ณด๋‚ด๋Š” ๊ธธ์„ ์—ด์–ด์ค€๋‹ค๋Š” ๋ฐ์— ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค.

Highway network๋“  ResNet์ด๋“  ๊ฒฐ๊ตญ Layer๋ฅผ ๊นŠ๊ฒŒ ์Œ“๊ณ  ์‹ถ์–ด์„œ ๋“ฑ์žฅํ•œ ๊ฒƒ์ด๋‹ค. (Layer๋ฅผ ๊นŠ๊ฒŒ ์Œ“๊ณ  ์‹ถ์—ˆ์œผ๋‚˜ Degradation ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ด Degradation ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฑฐ๋‹ˆ๊นŒ) ์ด ๋งฅ๋ฝ์—์„œ Layer๋ฅผ ๊นŠ๊ฒŒ ์Œ“์•„๋„ ๊ธฐ์กด์˜ ์ •๋ณด๋ฅผ ์ตœ๋Œ€ํ•œ ์žƒ์ง€ ์•Š๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ณ„๋„์˜ ์ƒ›๊ธธ์„ ๋šซ์–ด์ฃผ๋Š” ๋Š๋‚Œ์œผ๋กœ ์ดํ•ดํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.

ResNet์—์„œ ๋˜ ์ค‘์š”ํ•œ ๊ฒŒ Batch normalization์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์‹ค์ œ๋กœ x์™€ ๋”ํ•ด์ง€๋Š” F(x)๋Š” ์ด๊ฒƒ์ด๋‹ค. $$bn(conv(relu(bn(conv(x)))))$$
์ด๊ฑธ x์™€ Element-wise ๋”ํ•œ ๋‹ค์Œ relu์— ๋„ฃ๋Š”๋‹ค. ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. $$relu(x + bn(conv(relu(bn(conv(x))))))$$์ด๊ฑธ Residual block์ด๋ผ๊ณ  ํ•œ๋‹ค.
ResNet์˜ ์‹ค์ œ ๊ตฌ์กฐ

๋…ผ๋ฌธ์—์„œ๋Š” 1) VGG์— ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•œ Plain network์™€ 2) ResNet์„ ๋น„๊ตํ•œ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
https://arxiv.org/pdf/1512.03385.pdf

์›๊ธ€์—์„œ ํ˜ผ๋™๋˜๋Š” ๋ถ€๋ถ„์ด Plain network์™€ ResNet์ด ๊ณต์œ ํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๊ทœ์น™์ด์˜€๋Š”๋ฐ ๋‚ด์šฉ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

1) (ํ•ด๋‹น Convolutional layer์— ๋Œ€ํ•˜์—ฌ) Feature map size๊ฐ€ ๊ฐ™์œผ๋ฉด Filter ๊ฐœ์ˆ˜๋„ ๊ฐ™๋‹ค.
2) Feature map size๊ฐ€ ์ ˆ๋ฐ˜์ด ๋˜๋ฉด Filter ๊ฐœ์ˆ˜๋Š” ๋‘ ๋ฐฐ๊ฐ€ ๋œ๋‹ค. (Feature map size๊ฐ€ ์ ˆ๋ฐ˜์ด ๋˜๋Š” ๊ฒƒ์€ strides=2)

' ~ํ•˜๋ฉด ~ํ•˜๋‹ค'์˜ ํ˜•์‹์œผ๋กœ ์“ฐ์—ฌ์žˆ์–ด์„œ ์ธ๊ณผ ๊ด€๊ณ„์ธ์ค„ ์•Œ์•˜๋Š”๋ฐ ์•„๋‹ˆ๊ณ , ๊ทธ๋ƒฅ ์ด๋Ÿฐ ๊ฒฝ์šฐ์—๋Š” ์ด๋ ‡๋‹ค~ ์ •๋„์˜ ๋‚ด์šฉ์ด๋‹ค. ๊ทธ๋ฆผ์—์„œ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. Shortcut connection์ด ์ ์„ ์ธ ๋ถ€๋ถ„์„ ๋ณด๋ฉด, Filter ๊ฐœ์ˆ˜๋Š” ๋‘ ๋ฐฐ๊ฐ€ ๋˜๊ณ  (๊ฐ€๋ น 64์—์„œ 128, 128์—์„œ 256...) Feature map size๊ฐ€ ์ ˆ๋ฐ˜์ด ๋˜๋Š” ๊ฒƒ์€ Filter ๊ฐœ์ˆ˜ ๋’ค์— '/2' ๋ผ๊ณ  ํ‘œํ˜„ํ•œ ๋“ฏ ์‹ถ๋‹ค.

* ์ ์„  Shortcut connection: Feature map size๊ฐ€ ์ ˆ๋ฐ˜์ด ๋˜๋Š” ๊ฒฝ์šฐ (1x1 convolution๊ณผ bn์„ ์ ์šฉํ•˜๋ฏ€๋กœ Parameters๊ฐ€ ์žˆ์Œ)
* ์‹ค์„  Shortcut connection; ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ (Parameters๊ฐ€ ์—†์Œ)

์œ„ ๊ทธ๋ฆผ์€ 34-layer ์ผ ๋•Œ์˜ ๊ตฌ์กฐ์˜€๋Š”๋ฐ, Layer ๊ฐœ์ˆ˜๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ–ˆ์„ ๋•Œ์˜ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
https://arxiv.org/pdf/1512.03385.pdf

ResNet์€ ์–ด๋Š์ •๋„ Degradation์— ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค๊ณ  ๋ฐํ˜€์กŒ๋‹ค. CIFAR-10์— ์ ์šฉํ•œ ์ด์•ผ๊ธฐ๋ฅผ ์ œ์™ธํ•˜๋ฉด ResNet์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ๋Š” ์—ฌ๊ธฐ์„œ ๋๋‚œ๋‹ค. ๋‹ค๋งŒ ๊ฐ™์€ ์ €์ž์˜ ์ดํ›„ ๋…ผ๋ฌธ์—์„œ Residual block ๋‚ด๋ถ€ ๊ตฌ์กฐ ์ˆœ์„œ๋ฅผ ๋ณ€๊ฒฝํ•œ Pre-activaion ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€๋Š”๋ฐ, ์ด๊ฒƒ์œผ๋กœ ๋” Layer๋ฅผ ๊นŠ์ด ์Œ“์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. (์›๋ž˜ ๊ตฌ์กฐ๋Š” Post-activation ๊ตฌ์กฐ) ์ด ๋ถ€๋ถ„์€ ์›๊ธ€์˜ From 100 to 1000 Layers ๋ถ€๋ถ„์„ ์ฐธ๊ณ ํ•˜๋ฉด ๋œ๋‹ค. ์งง๋‹ค.

๋Œ“๊ธ€