๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๋จธ์‹ ๋Ÿฌ๋‹, ๋”ฅ๋Ÿฌ๋‹/OCR

Attention ๊ณต๋ถ€

by ํ–‰๋ฑ 2019. 8. 12.

์ฝ์€ ์ž๋ฃŒ:

https://youtu.be/iDulhoQ2pro

https://hulk89.github.io/neural%20machine%20translation/2017/04/04/attention-mechanism/

 

Image Captioning

https://github.com/zzsza/Deep_Learning_starting_with_the_latest_papers/blob/master/Lecture_Note/03.%20CNN%20Application/12.Image-Captioning.md

https://greeksharifa.github.io/computer%20vision/2019/04/17/Visual-Question-Answering/

- Image captioning: ์ด๋ฏธ์ง€๋ฅผ ์ฃผ๋ฉด ๊ทธ ์ด๋ฏธ์ง€๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ฌธ์žฅ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ
- VQA (Visual Question Answering): ์ด๋ฏธ์ง€์™€ ๊ทธ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์งˆ๋ฌธ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์˜ฌ๋ฐ”๋ฅธ ๋‹ต๋ณ€์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฒƒ

 

Attention

Motivation

https://smerity.com/articles/2016/google_nmt_arch.html

Image captioning์ด๋“  ๋ฒˆ์—ญ์ด๋“  Source๋ฅผ Target์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์—์„œ ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ Encoder - Decoder๋กœ ์ด๋ฃจ์–ด์ง„ ๋ชจ๋ธ์„ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. Attention ์ด์ „์—๋Š” Source์˜ ์ •๋ณด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ s๋กœ Encodeํ•˜๊ณ  ์ด๋ฅผ ๋‹ค์‹œ Decodeํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ๊ฐ€๋ น ๋‚ด๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฒˆ์—ญ์„ ํ•˜๊ณ  ์‹ถ๋‹ค๊ณ  ํ•˜์ž.

The cat eats the mouse
โž” Die katze frisst die maus

frisst๋Š” ์› ๋ฌธ์žฅ์˜ eats์— ํ•ด๋‹น๋˜๋Š” ๋‹จ์–ด์ธ๋ฐ, frisst๋ฅผ ๋งŒ๋“ค ๋•Œ eats์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ํ˜๋Ÿฌ์˜จ ๊ฒฝ๋กœ๋ฅผ ๋ณด๋ฉด ๋งค์šฐ ๊ธธ๋ฉฐ (๊ทธ๋ž˜์„œ ๋งŽ์€ transformations๊ฐ€ ๊ฐœ์ž…๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ: Long-range dependency) eats์— ํŠน์ •๋˜๋Š” ์ •๋ณด๋ฅผ ์ฐธ๊ณ ํ•˜์ง€๋„ ์•Š๋Š”๋‹ค. ๊ทธ๋ž˜์„œ frisst๋ฅผ ๋งŒ๋“ค ๋•Œ eats์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด Attention์ด๋‹ค.

"Before with RNNs, the entire sentence-to-sentence translation is one sample because we need to back propagate through all of these RNN steps (because they all happen in sequence). Here (with Attention) basically output of one single token is one sample. And then the computation is finished the back propagate through everything for this one step."

 

Query, Key, and Value

Q, K, V๋กœ ๋ถˆ๋ฆฌ๋Š” Query, Key, Value๋Š” ๊ฐ๊ฐ ๋ฌด์—‡์ผ๊นŒ? ์ด๊ฒƒ์— ๋Œ€ํ•œ ๋ช…์‹œ์ ์ธ ๋‹ต์„ ์ฐพ๊ธฐ๊ฐ€ ์–ด๋ ค์› ๋‹ค..

์œ ํˆฌ๋ธŒ์—์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์„ค๋ช…ํ•œ๋‹ค.
Q: output by the encoding part of target sentence
K, V: output by the encoding part of source sentence

์ง€๊ธˆ๊นŒ์ง€ ๋‚ด๊ฐ€ ์ดํ•ดํ•œ๋Œ€๋กœ ์ •๋ฆฌํ•ด๋ณด๋ฉด ...

Q: Target sentence๋ฅผ ๋งŒ๋“œ๋Š” ๊ณผ์ •์—์„œ, "ํ˜„์žฌ๊นŒ์ง€ ์ด๋ ‡๊ฒŒ Sentence๋ฅผ ๋งŒ๋“ค์–ด ์™”๋Š”๋ฐ ๊ทธ๋Ÿผ ์ด๋ฒˆ ์Šคํ…์—์„œ๋Š” ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋‚˜์™€์•ผ ํ•ด?" ํ•˜๋Š” ์ฟผ๋ฆฌ. Target์—์„œ ํ˜„์žฌ๊นŒ์ง€์˜ context๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. (ํ—ํฌ ๋ธ”๋กœ๊ทธ ๊ธ€์—์„œ c์— ํ•ด๋‹น)

K: Input sentence์˜ ๊ฐ ๋ถ€๋ถ„์œผ๋กœ, Q์™€ ์œ ์‚ฌ๋„๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ๋Œ€์ƒ. ์—ฌ๊ธฐ์„œ ์œ ์‚ฌ๋„๋Š” Q๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”๋ฐ ์–ผ๋งˆ๋‚˜ ๋„์›€์ด ๋˜๋Š” ์ •๋ณด์ธ๊ฐ€ (์ฆ‰ Q๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”๋ฐ ์ด K์— ์–ผ๋งˆ๋‚˜ Attention์„ ์ค˜์•ผ ํ•˜๋Š”๊ฐ€)๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ๊ฐ€๋ น ์œ„์˜ ์˜ˆ์—์„œ frisst๋ฅผ ๋งŒ๋“ค์–ด ๋‚ผ ์Šคํ…์—์„œ eats๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋„์›€์ด ๋˜๋Š” ์ •๋ณด์ธ๊ฐ€๋ฅผ ํƒ€์ง„ํ•œ๋‹ค. (๋ฌผ๋ก  The, cat, eats, the, mouse์— ๋Œ€ํ•ด ๊ฐ๊ฐ ์–ผ๋งˆ๋‚˜ ๋„์›€์ด ๋˜๋Š”์ง€๋ฅผ ํŒ๋‹จํ•˜๊ณ , softmax๋ฅผ ์ทจํ•ด eats๊ฐ€ ๊ฐ€์žฅ ๋„์›€์ด ๋˜๋Š” ์ •๋ณด๋ผ๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค.) Value๋ฅผ Indexing ํ•˜๋Š” ์—ญํ• ๋„ ํ•œ๋‹ค. (Index hidden states via softmax)

V: (K1, V1), (K2, V2) ... ์ด๋Ÿฐ ์‹์œผ๋กœ K์™€ Corresponding ํ•˜๋Š” ๊ฐ’. ์œ„์—์„œ ์„ค๋ช…ํ•œ softmax ๊ฐ’์— ๊ณฑํ•ด์ง€๋Š” ๊ฐ’. ์ƒํ™ฉ์— ๋”ฐ๋ผ K์™€ V๋Š” ๊ฐ™์€ ๊ฐ’์ด ๋  ์ˆ˜๋„ ์žˆ๋Š” ๋“ฏ. (ํ—ํฌ ๋ธ”๋กœ๊ทธ ๊ธ€์—์„œ K์™€ V๋Š” yi์— ํ•ด๋‹น)

์œ ํŠœ๋ธŒ์—์„œ๋Š” ์œ„์™€ ๊ฐ™์€ ๊ทธ๋ฆผ์œผ๋กœ ์„ค๋ช…ํ•œ๋‹ค.

(๋ฐฐ๊ฒฝ: Q, Ki๋Š” ๊ฐ๊ฐ ๋ฒกํ„ฐ์ด๊ณ , ๊ฐ Ki์— ๋Œ€์‘๋˜๋Š” Vi๊ฐ€ ์žˆ๋‹ค.)
"We compute the dot-product with each of the keys and then we compute a softmax over this - which means that one key will basically selected (k2). That has the biggest dot-product with Q. This is kind of indexing scheme into this memory of values."

"Basically, the encoder of the source sentence discovers interesting things and it builds Key-Value pairs. And then the encoder of the target sentence builds the Queries. And together they give you kind of the next signal."
"Here's a bunch of things about the source sentence that you might find interesting - that's the value V. The Keys are ways to index the values. And Q is like 'I want to know certain things.'."

 

Mechanism

๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด Attention์€ ...
- ์ธํ’‹์„ n๊ฐœ์˜ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆ„๊ณ  Encoder๋กœ ๊ฐ ๋ถ€๋ถ„์— ๋Œ€ํ•œ ํ‘œํ˜„์ธ h1, ..., hn์„ ๋งŒ๋“ ๋‹ค. (๊ฐ hidden state)
- ์•„์›ƒํ’‹์„ ๋งŒ๋“œ๋Š” ์–ด๋–ค ์ˆœ๊ฐ„ c์—์„œ h1, ..., hn ์ค‘ ๊ด€๋ จ๋œ ๋ถ€๋ถ„์ด ๋ฌด์—‡์ธ์ง€ ์ฐพ๋Š”๋‹ค. (c๋Š” context์˜ ์•ฝ์ž๋กœ ๊ทธ ์ˆœ๊ฐ„์˜ context๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Œ)


๋Œ“๊ธ€