๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๋จธ์‹ ๋Ÿฌ๋‹, ๋”ฅ๋Ÿฌ๋‹/Paper Classification10

4/23 ํ•ด๋ดค๋˜ ๊ฒƒ ์ •๋ฆฌ 1. RMDL ์ ์šฉ ์‹œ๋„ Colab์— pip ์„ค์น˜ํ•ด ํ•˜๋‹ค๊ฐ€ checkpoint ๊ฒฝ๋กœ ๋ฐ”๊ฟ”๋ณด๋ ค๊ณ  ๋‚ด ๊นƒํ—™์— forkํ•˜๊ณ  ๊ฒฝ๋กœ ๊ณ ์ณ์„œ ์ปค๋ฐ‹ํ•œ ๋‹ค์Œ git cloneํ•จ (pip ์„ค์น˜ํ•˜๊ณ ๋„ ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋Š” ์ง€๋Š” ๋ชจ๋ฅด๊ฒ ์Œ..) https://stackoverflow.com/questions/49322072/checkpoints-in-google-colab : ์ด๊ฑฐ ๋ณด๋ฉด checkpoint ๊ฒฝ๋กœ๋ฅผ /gdrive ๋‚ด๋กœ ๋ฐ”๊ฟ”๋„ ์•ˆ ๋  ์ˆ˜๋„ ์žˆ์„ ๊ฒƒ ๊ฐ™์Œ (๋ฌผ๋ก  mount ํ›„์—..) https://research.google.com/colaboratory/local-runtimes.html : Colab ๋กœ์ปฌ ๋Ÿฐํƒ€์ž„ ๊ด€๋ จ document. ๋กœ์ปฌ์—์„œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ๋กœ์ปฌ ํŒŒ์ผ ์‹œ์Šคํ…œ์— ์—‘์„ธ์Šคํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ. ๋‹จ์ .. 2020. 4. 24.
NLP Encoding (wikidocs) Integer encoding - ๋‹จ์–ด ์ง‘ํ•ฉ(vocab)์˜ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์ˆซ์ž๋ฅผ ๋ถ€์—ฌ - ๋ฐฉ๋ฒ•: python dictionary, Counter, NLTK FreqDist, Keras preprocessing.text 1) sentence/word tokenization, cleaning, normalization 2) key=๋‹จ์–ด, value=๋นˆ๋„์ˆ˜๋กœ ํ•˜์—ฌ ๋‹จ์–ด ์ง‘ํ•ฉ์„ ๋งŒ๋“ค๊ณ  ๋นˆ๋„์ˆ˜ ์ˆœ์œผ๋กœ ์ •๋ ฌ 3) ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๋‹จ์–ด๋ถ€ํ„ฐ ๋‚ฎ์€ ์ •์ˆ˜ ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌ 4) ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์€ ๋‹จ์–ด๋ฅผ ๋‹จ์–ด ์ง‘ํ•ฉ์—์„œ ์ œ์™ธํ•  ์ˆ˜ ์žˆ์Œ 5) ์ž์—ฐ์–ด ์ƒํƒœ์˜ ๋‹จ์–ด๋ฅผ ์ •์ˆ˜ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ - ์ •์ˆ˜ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์—์„œ OOV๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ - OOV(Out-Of-Vocabulary): ๋‹จ์–ด ์ง‘ํ•ฉ์— ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด (๋นˆ๋„์ˆ˜๊ฐ€ ์ ์–ด .. 2020. 3. 19.
NLP Cleaning and Normalization (wikidocs) Cleaning (์ •์ œ) - corpus๋กœ๋ถ€ํ„ฐ ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐ - ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ๋Š” ํŠน์ˆ˜ ๋ฌธ์ž ์™ธ์— ๋ถ„์„ ๋ชฉ์ ์— ๋งž์ง€ ์•Š๋Š” ๋ถˆํ•„์š”ํ•œ ๋‹จ์–ด๋ฅผ ๋งํ•˜๊ธฐ๋„ ํ•จ 1) Removing stopwords - stopword: ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๋‹จ์–ด (I, my, me, over, ์กฐ์‚ฌ, ์ ‘๋ฏธ์‚ฌ ๋“ฑ) - NLTK์—์„œ ์ •์˜ํ•˜๊ณ  ์žˆ์Œ - ํ•œ๊ตญ์–ด ๋ถˆ์šฉ์–ด์˜ ๊ฒฝ์šฐ๋Š” ๋ฏธ๋ฆฌ ๋ถˆ์šฉ์–ด ์‚ฌ์ „์„ ์ •์˜ํ•ด๋‘๊ณ  ์‚ฌ์šฉ (https://www.ranks.nl/stopwords/korean) from nltk.corpus import stopwords from nltk.tokenize import word_tokenize stop_words = set(stopwords.words('english')) text = 'Family is not an .. 2020. 3. 16.
NLP Tokenization (wikidocs) Tokenization - corpus๋ฅผ token ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ์ž‘์—… - token์€ ๋ณดํ†ต ์˜๋ฏธ์žˆ๋Š” ๋‹จ์œ„๋กœ ์ •์˜ 1) Word tokenization - token์˜ ๊ธฐ์ค€์„ word๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ - word๋Š” ๋‹จ์–ด, ๋‹จ์–ด๊ตฌ, ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋ฌธ์ž์—ด๋กœ๋„ ๊ฐ„์ฃผ๋˜๊ธฐ๋„ ํ•จ - ๋‹จ์ˆœํžˆ punctuation์„ ์ œ๊ฑฐํ•˜๊ณ  whitespace๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ† ํฐํ™”ํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹˜ - ๊ฐ€๋ น don't ์ฒ˜๋Ÿผ '๋กœ ์ ‘์–ด๊ฐ€ ๋ฐœ์ƒํ•œ ๊ฒฝ์šฐ ๋‹ค์–‘ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ (do + n't / don + t ๋“ฑ) - ๋‹จ์–ด ์ž์ฒด์— punctuation์„ ๊ฐ€์ง„ ๊ฒฝ์šฐ๋„ ์žˆ์Œ (m.p.h / Ph.D. / AT&T) - ์ˆซ์ž ์‚ฌ์ด์— punctuation์ด ๋“ค์–ด๊ฐ”์œผ๋‚˜ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ๋„ ์žˆ์Œ ($45.55 / 01/02/06 / 123,456.. 2020. 3. 16.