๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๋จธ์‹ ๋Ÿฌ๋‹, ๋”ฅ๋Ÿฌ๋‹/Paper Classification

NLP Cleaning and Normalization (wikidocs)

by ํ–‰๋ฑ 2020. 3. 16.

Cleaning (์ •์ œ)

- corpus๋กœ๋ถ€ํ„ฐ ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐ

- ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ๋Š” ํŠน์ˆ˜ ๋ฌธ์ž ์™ธ์— ๋ถ„์„ ๋ชฉ์ ์— ๋งž์ง€ ์•Š๋Š” ๋ถˆํ•„์š”ํ•œ ๋‹จ์–ด๋ฅผ ๋งํ•˜๊ธฐ๋„ ํ•จ

 

1) Removing stopwords

- stopword: ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๋‹จ์–ด (I, my, me, over, ์กฐ์‚ฌ, ์ ‘๋ฏธ์‚ฌ ๋“ฑ)

- NLTK์—์„œ ์ •์˜ํ•˜๊ณ  ์žˆ์Œ

- ํ•œ๊ตญ์–ด ๋ถˆ์šฉ์–ด์˜ ๊ฒฝ์šฐ๋Š” ๋ฏธ๋ฆฌ ๋ถˆ์šฉ์–ด ์‚ฌ์ „์„ ์ •์˜ํ•ด๋‘๊ณ  ์‚ฌ์šฉ (https://www.ranks.nl/stopwords/korean)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
text = 'Family is not an important thing. It\'s everything.'
word_tokens = word_tokenize(text)

result = []
for w in word_tokens:
	if w not in stop_words:
    	result.append(w)

print(word_tokens)
print(result) # 'is', 'not', 'an' are removed.

 

2) Removing rare words

- ๋“ฑ์žฅ ๋นˆ๋„๊ฐ€ ์ ์€ ๋‹จ์–ด ์ œ๊ฑฐ

 

3) Removing words with very short length

- ์˜์–ด๊ถŒ ์–ธ์–ด์—์„œ ๊ธธ์ด๊ฐ€ ์งง์€ ๋‹จ์–ด๋“ค์€ ๋Œ€๋ถ€๋ถ„ stopwords(๋ถˆ์šฉ์–ด)์— ํ•ด๋‹น๋จ

- ๋ฐ˜๋ฉด ํ•œ๊ตญ์–ด๋Š” ํ•œ ๊ธ€์ž์— ํ•จ์ถ•์ ์ธ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ์œ ํšจํ•˜์ง€ ์•Š์Œ

- ์˜์–ด ๋‹จ์–ด์˜ ํ‰๊ท  ๊ธธ์ด๋Š” 6~7, ํ•œ๊ตญ์–ด ๋‹จ์–ด์˜ ํ‰๊ท  ๊ธธ์ด๋Š” 2~3

 

- ๊ทธ๋ž˜์„œ ์˜์–ด ํ…์ŠคํŠธ์—์„œ ๊ธธ์ด๊ฐ€ 1์ธ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด stopwords์ธ 'a'์™€ 'I'๊ฐ€ ์ œ๊ฑฐ๋จ

- ๊ธธ์ด๊ฐ€ 2์ธ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด 'it', 'at', to', 'in', 'by' ๋“ฑ์˜ stopwords๊ฐ€ ์ œ๊ฑฐ๋จ

- ๊ธธ์ด๊ฐ€ 3์ธ ๋‹จ์–ด๋ถ€ํ„ฐ ๋ช…์‚ฌ๊ฐ€ ์ œ๊ฑฐ๋˜๊ธฐ ์‹œ์ž‘ํ•จ

 


Normalization (์ •๊ทœํ™”)

- ํ‘œํ˜„ ๋ฐฉ๋ฒ•์ด ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์„ ํ†ตํ•ฉ์‹œ์ผœ ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ๋งŒ๋“ฆ

 

1) Lemmatization (ํ‘œ์ œ์–ด ์ถ”์ถœ)

- lemma: ํ‘œ์ œ์–ด

- lemmatization์€ ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๋”๋ผ๋„ ๊ทธ ๋ฟŒ๋ฆฌ ๋‹จ์–ด๋ฅผ ์ฐพ์•„๊ฐ€ ๋‹จ์–ด ๊ฐœ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š”์ง€ ํŒ๋‹จํ•จ

- ๊ฐ€๋ น 'am', 'are', 'is'์˜ ํ‘œ์ œ์–ด๋Š” 'be'

 

- lemmatization์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ˜•ํƒœํ•™์  ํŒŒ์‹ฑ์„ ํ•ด์•ผ ํ•จ

- ํ˜•ํƒœํ•™์  ํŒŒ์‹ฑ์€ ์–ด๊ฐ„(stem)๊ณผ ์ ‘์‚ฌ(affix)๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์ž„

- ์–ด๊ฐ„: ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ํ•ต์‹ฌ ๋ถ€๋ถ„, ์ ‘์‚ฌ: ๋‹จ์–ด์— ์ถ”๊ฐ€์ ์ธ ์˜๋ฏธ๋ฅผ ์ฃผ๋Š” ๋ถ€๋ถ„

- ๊ฐ€๋ น cats๋Š” ์–ด๊ฐ„ 'cat', ์ ‘์‚ฌ 's'๋กœ ๋ถ„๋ฆฌ

 

- lemmatization์€ stemming๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋‹จ์–ด์˜ ํ˜•ํƒœ๊ฐ€ ์ ์ ˆํžˆ ๋ณด์กด๋จ

- ๊ทธ๋Ÿฌ๋‚˜ lemmatizer๊ฐ€ ๋ณธ๋ž˜ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ ์ •๋ณด๋ฅผ ๋ชจ๋ฅด๋ฉด ์ ์ ˆํ•˜์ง€ ์•Š์€ ๋‹จ์–ด๋ฅผ ์ถœ๋ ฅํ•จ

- lemmatization์€ ๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•˜๋ฉฐ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ ์ •๋ณด๋ฅผ ๋ณด์กดํ•จ (pos ํƒœ๊ทธ๋ฅผ ๋ณด์กด)

- stemming์€ ํ’ˆ์‚ฌ ์ •๋ณด๋ฅผ ๋ณด์กดํ•˜์ง€ ์•Š์œผ๋ฉฐ (pos ํƒœ๊ทธ๋ฅผ ๋ณด์กดํ•˜์ง€ ์•Š์Œ) ๊ฒฐ๊ณผ๊ฐ€ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด์ผ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ

from nltk.stem import WordNetLemmatizer
n = WordNetLemmatizer()
words = ['dies', 'watched', 'has']
print([n.lemmatize(w) for w in words]) # ['dy', 'watched', 'ha']
print([n.lemmatize(w, 'v') for w in words]) # ['die', 'watch', 'have']

 

2) Stemming (์–ด๊ฐ„ ์ถ”์ถœ)

- ์–ด๊ฐ„์„ ์ถ”์ถœํ•˜๋Š” ์ž‘์—…

- ์ •ํ•ด์ง„ ๊ทœ์น™์— ๋”ฐ๋ผ ์–ด๋ฆผ์žก์•„ ์–ด๊ฐ„์„ ์ถ”์ถœํ•จ

- ์„ฌ์„ธํ•œ ์ž‘์—…์ด ์•„๋‹ˆ๋ฏ€๋กœ ๊ฒฐ๊ณผ๊ฐ€ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด์ผ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ

- ์ผ๋ฐ˜์ ์œผ๋กœ stemming์ด lemmatization๋ณด๋‹ค ๋น ๋ฆ„

 

- stemmer๋งˆ๋‹ค ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์“ฐ๋ฏ€๋กœ ๊ฒฐ๊ณผ๊ฐ€ ์ „ํ˜€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ

- corpus์— stemmer๋ฅผ ์ ์šฉํ•ด๋ณด๊ณ  ์ ์ ˆํ•œ stemmer๋ฅผ ๊ณจ๋ผ์•ผ ํ•จ

 

- stemming ๊ฒฐ๊ณผ๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋˜์—ˆ๊ฑฐ๋‚˜ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋œ ๋œ ๊ฒฝ์šฐ๊ฐ€ ์ƒ๊น€

- ๊ฐ€๋ น 'organization'๊ณผ 'organ'์€ ์ „ํ˜€ ๋‹ค๋ฅธ ๋‹จ์–ด์ง€๋งŒ stemming ๊ฒฐ๊ณผ 'organ'์œผ๋กœ ๋™์ผํ•  ์ˆ˜ ์žˆ์Œ

from nltk.stem import PorterStemmer, LancasterStemmer
p = PorterStemmer()
l = LancasterStemmer()
words = ['policy', 'organization']
print([p.stem(w) for w in words]) # ['polici', 'organ']
print([l.stem(w) for w in words]) # ['policy', 'org']

 

3) ๋Œ€์†Œ๋ฌธ์ž ํ†ตํ•ฉ

- ๋Œ€์†Œ๋ฌธ์ž๊ฐ€ ๊ตฌ๋ถ„๋˜์–ด์•ผ ํ•˜๋Š” ์ƒํ™ฉ๋„ ์žˆ์Œ (US vs us / ๊ณ ์œ ๋ช…์‚ฌ์˜ ์•ž๊ธ€์ž)

- ๋”ฐ๋ผ์„œ ์–ธ์ œ ์†Œ๋ฌธ์ž ๋ณ€ํ™˜์„ ์‚ฌ์šฉํ• ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ML ์‹œํ€€์Šค ๋ชจ๋ธ์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Œ

- ๊ทธ๋Ÿฌ๋‚˜ corpus ์ž์ฒด์—์„œ ๋Œ€์†Œ๋ฌธ์ž๊ฐ€ ์ž์œ ๋กญ๊ฒŒ ์“ฐ์˜€๋‹ค๋ฉด ์˜๋ฏธ๊ฐ€ ์—†์–ด ๋ชจ๋“  corpus๋ฅผ ์†Œ๋ฌธ์ž๋กœ ๋ฐ”๊พธ๋Š” ๊ฒŒ ๋‚˜์„์ˆ˜๋„

 

 

References

https://wikidocs.net/21693

https://wikidocs.net/21707

https://wikidocs.net/22530

๋Œ“๊ธ€