๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๋จธ์‹ ๋Ÿฌ๋‹, ๋”ฅ๋Ÿฌ๋‹/Paper Classification

NLP tutorial (wikidocs)

by ํ–‰๋ฑ 2020. 3. 16.

Pandas

1) Series

- 1์ฐจ์› ๋ฐฐ์—ด์˜ ๊ฐ’์— ๊ฐ ๊ฐ’์— ๋Œ€์‘๋˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ์Œ

- value์™€ index๋กœ ๊ตฌ์„ฑ

 

2) DataFrame

- 2์ฐจ์› ๋ฐฐ์—ด์˜ ๊ฐ’๊ณผ ํ–‰ ๋ฐฉํ–ฅ ์ธ๋ฑ์Šค, ์—ด ๋ฐฉํ–ฅ ์ธ๋ฑ์Šค๋กœ ๊ตฌ์„ฑ

- value, index, column์œผ๋กœ ๊ตฌ์„ฑ

- list, dict, ndarrays, Series, ๋˜ ๋‹ค๋ฅธ DataFrame์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ

- csv, text, excel, sql, html, json ๋“ฑ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ํŒŒ์ผ์„ ์ฝ์–ด ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ

 

 

Numpy

1) ndarray ์ƒ์„ฑ

- np.array()๋กœ list, tuple๋กœ๋ถ€ํ„ฐ ndarray๋ฅผ ์ƒ์„ฑ

- np.zeros(shape), np.ones(shape), np.full(shape, num), np.eye(shape), np.random.random(shape)

- np.arange(start, stop, step, dtype)

- arr.ndim: ์ฐจ์›, arr.shape: ํฌ๊ธฐ

 

3) ndarray reshape

- arr.reshape(shape)

 

4) ndarray slicing

- arr = arr[0:2, 0:2]

- ๋‹ค์ฐจ์› ๋ฐฐ์—ด์„ ์Šฌ๋ผ์ด์‹ฑํ•  ๋•Œ๋Š” ๊ฐ ์ฐจ์› ๋ณ„๋กœ ์Šฌ๋ผ์ด์Šค ๋ฒ”์œ„๋ฅผ ์ง€์ •ํ•ด์•ผ ํ•จ

 

5) ndarray integer indexing

- ์›๋ณธ ๋ฐฐ์—ด๋กœ๋ถ€ํ„ฐ ๋ถ€๋ถ„ ๋ฐฐ์—ด์„ ๊ตฌํ•จ

- arr = arr[[2, 1], [1, 0]]์€ 2ํ–‰ 1์—ด๊ณผ 1ํ–‰ 0์—ด์˜ ์›์†Œ๋ฅผ ๊ฐ€์ง€๋Š” ndarray

 

6) ndarray arithmetic

- +, -, *, / ๋˜๋Š” np.add(), np.subtract(), np.multiply(), np.divide(): ๋ฐฐ์—ด ๊ฐ ์š”์†Œ์— ๋Œ€ํ•˜์—ฌ ์—ฐ์‚ฐ

- np.dot(): ํ–‰๋ ฌ ๊ณฑ

 

 

Matplotlib

- plt.title(), plt.plot(), plt.xlabel(), plt.ylabel(), plt.legend(), plt.show()

- plt.plot([1, 2, 3, 4], [2, 4, 8, 6]): [1, 2, 3, 4]๊ฐ€ xlabel์˜ ๊ฐ’, [2, 4, 8, 6]์ด ylabel์˜ ๊ฐ’

- plt.plot() ์—ฌ๋Ÿฌ ๊ฐœ ์จ์„œ ๋ผ์ธ์„ ์—ฌ๋Ÿฌ ๊ฐœ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ

- plt.legend()๋กœ ๊ฐ ๋ผ์ธ์ด ๋ฌด์—‡์ธ์ง€๋ฅผ ํ‘œ์‹œํ•˜๋Š” ๋ฒ”๋ก€๋ฅผ ์‚ฝ์ž…ํ•  ์ˆ˜ ์žˆ์Œ

 

 

EDA

- ML์„ ๋Œ๋ฆฌ๊ธฐ ์ด์ „์— ๋ฐ์ดํ„ฐ์˜ ์„ฑ๊ฒฉ์„ ๋จผ์ € ํŒŒ์•…ํ•ด์•ผ ํ•จ

- ๋ฐ์ดํ„ฐ ๋‚ด ๊ฐ’์˜ ๋ถ„ํฌ, ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„, NULL ๊ฐ’ ์กด์žฌ ์—ฌ๋ถ€ ๋“ฑ์„ ํŒŒ์•…

- ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ EDA (Exploratory Data Analysis; ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„) ์ด๋ผ๊ณ  ํ•จ

 

 

Pandas profiling

import pandas as pd
import pandas_profiling

data = pd.read_csv('/my_csv.csv', encoding='latin1')
pr = data.profile_report()
pr.to_file

- Overview: Dataset info, Variable types, Warnings

- Variables: ๊ฐ feature์— ๋Œ€ํ•œ ํ†ต๊ณ„์น˜ ์ œ๊ณต, Toglle details๋กœ ์ƒ์„ธ์‚ฌํ•ญ ํ™•์ธ ๊ฐ€๋Šฅ

 

 

ML workflow

1) ์ˆ˜์ง‘ (Acuisition)

- corpus(์ž์—ฐ์–ด ๋ฐ์ดํ„ฐ)๋ฅผ ์ˆ˜์ง‘

- txt, csv, xml ๋“ฑ

 

2) ์ ๊ฒ€ ๋ฐ ํƒ์ƒ‰ (Inspection and Exploration)

- EDA ๋‹จ๊ณ„๋ผ๊ณ ๋„ ํ•จ

- ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ/ํŠน์ง•/๊ด€๊ณ„๋ฅผ ํŒŒ์•…

- ์‹œ๊ฐํ™”๋‚˜ ๊ฐ„๋‹จํ•œ ํ†ต๊ณ„ ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ธฐ๋„ ํ•จ

 

3) ์ „์ฒ˜๋ฆฌ ๋ฐ ์ •์ œ (Preprocessing and Cleaning)

- NLP์˜ ๊ฒฝ์šฐ ํ† ํฐํ™”, ์ •์ œ, ์ •๊ทœํ™”, ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ๋“ฑ์„ ํฌํ•จ

- ๋‹ค์–‘ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๋Œ€ํ•œ ์ง€์‹ ํ•„์š”

 

4) ๋ชจ๋ธ๋ง ๋ฐ ํ›ˆ๋ จ (Modeling and Training)

- training set, validation set, testing set์œผ๋กœ ๋‚˜๋ˆ”

- training set์œผ๋กœ ํ•™์Šตํ•˜๊ณ  validation set์œผ๋กœ ๊ฒ€์ฆํ•˜๋ฉฐ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๊ฐœ์„ 

 

5) ํ‰๊ฐ€ (Evaluation)

- testing set์œผ๋กœ ํ‰๊ฐ€

 

6) ๋ฐฐํฌ (Deployment)

 

 

References

https://wikidocs.net/32829

https://wikidocs.net/47193

https://wikidocs.net/31947

๋Œ“๊ธ€