๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๋จธ์‹ ๋Ÿฌ๋‹, ๋”ฅ๋Ÿฌ๋‹/Paper Classification

NLP Tokenization (wikidocs)

by ํ–‰๋ฑ 2020. 3. 16.

Tokenization

- corpus๋ฅผ token ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ์ž‘์—…

- token์€ ๋ณดํ†ต ์˜๋ฏธ์žˆ๋Š” ๋‹จ์œ„๋กœ ์ •์˜

 

1) Word tokenization

- token์˜ ๊ธฐ์ค€์„ word๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ

- word๋Š” ๋‹จ์–ด, ๋‹จ์–ด๊ตฌ, ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋ฌธ์ž์—ด๋กœ๋„ ๊ฐ„์ฃผ๋˜๊ธฐ๋„ ํ•จ

- ๋‹จ์ˆœํžˆ punctuation์„ ์ œ๊ฑฐํ•˜๊ณ  whitespace๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ† ํฐํ™”ํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹˜

 

- ๊ฐ€๋ น don't ์ฒ˜๋Ÿผ '๋กœ ์ ‘์–ด๊ฐ€ ๋ฐœ์ƒํ•œ ๊ฒฝ์šฐ ๋‹ค์–‘ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ (do + n't / don + t ๋“ฑ)

- ๋‹จ์–ด ์ž์ฒด์— punctuation์„ ๊ฐ€์ง„ ๊ฒฝ์šฐ๋„ ์žˆ์Œ (m.p.h / Ph.D. / AT&T)

- ์ˆซ์ž ์‚ฌ์ด์— punctuation์ด ๋“ค์–ด๊ฐ”์œผ๋‚˜ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ๋„ ์žˆ์Œ ($45.55 / 01/02/06 / 123,456 )

- ๋‹จ์–ด ๋‚ด์— ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์Œ (New York / Rock 'n' roll)

 

- "Penn Treebank Tokenization"

- ํ‘œ์ค€ ํ† ํฐํ™” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜

- ๊ทœ์น™ 1) ํ•˜์ดํ”ˆ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋‹จ์–ด๋Š” ํ•˜๋‚˜๋กœ ์œ ์ง€

- ๊ทœ์น™ 2) '๋กœ ์ ‘์–ด๊ฐ€ ํ•จ๊ป˜ํ•˜๋Š” ๋‹จ์–ด๋Š” ๋ถ„๋ฆฌ

from nltk.tokenize import word_tokenize
word_tokenize(corpus)

from nltk.tokenize import WordPunctTokenizer
WordPunctTokenizer(corpus)

from tensorflow.keras.preprocessing.text import text_to_word_sequence
text_to_word_sequence(corpus)

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

 

2) Sentence tokenization

= Sentence segmentation

- !์™€ ?๋Š” ๋น„๊ต์  ์ •ํ™•ํ•œ ๋ฌธ์žฅ ๊ตฌ๋ถ„์ž(boundary)๊ฐ€ ๋˜์ง€๋งŒ .๋Š” ๊ผญ ๊ทธ๋ ‡์ง€๋Š” ์•Š์Œ (192.168.56.31 / Ph.D.)

- .์ด ๋‹จ์–ด์˜ ์ผ๋ถ€๋ถ„์ธ์ง€ ๋ฌธ์žฅ์˜ ๊ตฌ๋ถ„์ž์ธ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” binary classifier๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•จ

- ์˜คํ”ˆ์†Œ์Šค: NLTK, OpenNLP, CoreNLP, splitta, LingPipe ๋“ฑ

from nltk.tokenize import sent_tokenize
sent_tokenize(corpus)

%pip install kss // Korean Sentence Splitter
import kss
kss.split_sentences(corpus)

 

Korean tokenization

- ์˜์–ด๋Š” ๊ฑฐ์˜ ๋„์–ด์“ฐ๊ธฐ ํ† ํฐํ™”์™€ ๋‹จ์–ด ํ† ํฐํ™”๊ฐ€ ๊ฐ™์Œ

- ํ•˜์ง€๋งŒ ํ•œ๊ตญ์–ด๋Š” ๊ต์ฐฉ์–ด์ด๋ฏ€๋กœ ๋„์–ด์“ฐ๊ธฐ ํ† ํฐํ™”๋กœ๋Š” ๋ถ€์กฑํ•จ (๊ต์ฐฉ์–ด: ์กฐ์‚ฌ, ์–ด๋ฏธ ๋“ฑ์„ ๋ถ™์—ฌ ๋ง์„ ๋งŒ๋“œ๋Š” ์–ธ์–ด)

- ๋”ฐ๋ผ์„œ ํ˜•ํƒœ์†Œ ํ† ํฐํ™”๋ฅผ ํ•ด์•ผ ํ•จ (ํ˜•ํƒœ์†Œ: ๋œป์„ ๊ฐ€์ง„ ๊ฐ€์žฅ ์ž‘์€ ๋ง์˜ ๋‹จ์œ„. ์ž๋ฆฝ ํ˜•ํƒœ์†Œ์™€ ์˜์กด ํ˜•ํƒœ์†Œ๋กœ ๋‚˜๋‰จ.)

- ๊ฒŒ๋‹ค๊ฐ€ ํ•œ๊ตญ์–ด๋Š” ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์–ด๋ ต๊ณ  ๋„์–ด์“ฐ๊ธฐ ์—†์ด๋„ ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ์‰ฌ์›Œ ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ž˜ ์ง€์ผœ์ง€์ง€ ์•Š์Œ

from konlpy.tag import Okt
okt = Okt()
okt.morphs(text)
okt.pos(text)
okt.nouns(text)

from konlpy.tag import Kkma  
kkma = Kkma()
kkma.morphs(text)
kkma.pos(text)
kkma.nouns(text)

 

Part-of-speech tagging

= ํ’ˆ์‚ฌ ํƒœ๊น…

- ํ‘œ๊ธฐ๋Š” ๊ฐ™์•„๋„ ํ’ˆ์‚ฌ์— ๋”ฐ๋ผ ๋‹จ์–ด์˜ ์˜๋ฏธ๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Œ (์˜์–ด: fly, ํ•œ๊ตญ์–ด: ๋ชป)

- ๋”ฐ๋ผ์„œ ํ’ˆ์‚ฌ ํƒœ๊น…์ด ํ•„์š”ํ•  ์ˆ˜๋„ ์žˆ์Œ

- ์˜คํ”ˆ์†Œ์Šค: NLTK, KoNLPy

from nltk.tag import word_tokenize, pos_tag
pos_tag(word_tokenize(text))

 

 

References

https://wikidocs.net/21698

๋Œ“๊ธ€