๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๋จธ์‹ ๋Ÿฌ๋‹, ๋”ฅ๋Ÿฌ๋‹/Paper Classification

NLP Encoding (wikidocs)

by ํ–‰๋ฑ 2020. 3. 19.

Integer encoding

- ๋‹จ์–ด ์ง‘ํ•ฉ(vocab)์˜ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์ˆซ์ž๋ฅผ ๋ถ€์—ฌ

- ๋ฐฉ๋ฒ•: python dictionary, Counter, NLTK FreqDist, Keras preprocessing.text

 

1) sentence/word tokenization, cleaning, normalization

2) key=๋‹จ์–ด, value=๋นˆ๋„์ˆ˜๋กœ ํ•˜์—ฌ ๋‹จ์–ด ์ง‘ํ•ฉ์„ ๋งŒ๋“ค๊ณ  ๋นˆ๋„์ˆ˜ ์ˆœ์œผ๋กœ ์ •๋ ฌ

3) ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๋‹จ์–ด๋ถ€ํ„ฐ ๋‚ฎ์€ ์ •์ˆ˜ ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌ

4) ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์€ ๋‹จ์–ด๋ฅผ ๋‹จ์–ด ์ง‘ํ•ฉ์—์„œ ์ œ์™ธํ•  ์ˆ˜ ์žˆ์Œ

5) ์ž์—ฐ์–ด ์ƒํƒœ์˜ ๋‹จ์–ด๋ฅผ ์ •์ˆ˜ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜

- ์ •์ˆ˜ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์—์„œ OOV๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ

- OOV(Out-Of-Vocabulary): ๋‹จ์–ด ์ง‘ํ•ฉ์— ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด (๋นˆ๋„์ˆ˜๊ฐ€ ์ ์–ด ์ œ์™ธ๋œ ๋‹จ์–ด)

 

 

One-hot encoding

- Integer encoding ํ›„์— ์ง„ํ–‰

- ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๋ฅผ ๋ฒกํ„ฐ์˜ ์ฐจ์›์œผ๋กœ ํ•˜๊ณ , ํ‘œํ˜„ํ•˜๊ณ  ์‹ถ์€ ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค์— 1, ๋‚˜๋จธ์ง€ ์ธ๋ฑ์Šค์— 0์„ ๋ถ€์—ฌ

- ๊ฒฐ๊ณผ ๋ฒกํ„ฐ๋ฅผ one-hot vector๋ผ๊ณ  ํ•จ

 

One-hot encoding์˜ ํ•œ๊ณ„

- ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ๋ฒกํ„ฐ์˜ ์ฐจ์›์ด ์ปค์ง

- ๋งค์šฐ sparseํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ด ๋–จ์–ด์ง

- ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ํ‘œํ˜„ํ•˜์ง€ ๋ชปํ•จ

 

 

References

https://wikidocs.net/31766

https://wikidocs.net/22647

๋Œ“๊ธ€