NLP tutorial (wikidocs)

Pandas

1) Series

- 1차원 배열의 값에 각 값에 대응되는 인덱스를 부여할 수 있음

- value와 index로 구성

2) DataFrame

- 2차원 배열의 값과 행 방향 인덱스, 열 방향 인덱스로 구성

- value, index, column으로 구성

- list, dict, ndarrays, Series, 또 다른 DataFrame으로 생성할 수 있음

- csv, text, excel, sql, html, json 등 외부 데이터 파일을 읽어 생성할 수 있음

Numpy

1) ndarray 생성

- np.array()로 list, tuple로부터 ndarray를 생성

- np.zeros(shape), np.ones(shape), np.full(shape, num), np.eye(shape), np.random.random(shape)

- np.arange(start, stop, step, dtype)

- arr.ndim: 차원, arr.shape: 크기

3) ndarray reshape

- arr.reshape(shape)

4) ndarray slicing

- arr = arr[0:2, 0:2]

- 다차원 배열을 슬라이싱할 때는 각 차원 별로 슬라이스 범위를 지정해야 함

5) ndarray integer indexing

- 원본 배열로부터 부분 배열을 구함

- arr = arr[[2, 1], [1, 0]]은 2행 1열과 1행 0열의 원소를 가지는 ndarray

6) ndarray arithmetic

- +, -, *, / 또는 np.add(), np.subtract(), np.multiply(), np.divide(): 배열 각 요소에 대하여 연산

- np.dot(): 행렬 곱

Matplotlib

- plt.title(), plt.plot(), plt.xlabel(), plt.ylabel(), plt.legend(), plt.show()

- plt.plot([1, 2, 3, 4], [2, 4, 8, 6]): [1, 2, 3, 4]가 xlabel의 값, [2, 4, 8, 6]이 ylabel의 값

- plt.plot() 여러 개 써서 라인을 여러 개 추가할 수 있음

- plt.legend()로 각 라인이 무엇인지를 표시하는 범례를 삽입할 수 있음

EDA

- ML을 돌리기 이전에 데이터의 성격을 먼저 파악해야 함

- 데이터 내 값의 분포, 변수 간의 관계, NULL 값 존재 여부 등을 파악

- 이러한 과정을 EDA (Exploratory Data Analysis; 탐색적 데이터 분석) 이라고 함

Pandas profiling

import pandas as pd
import pandas_profiling

data = pd.read_csv('/my_csv.csv', encoding='latin1')
pr = data.profile_report()
pr.to_file

- Overview: Dataset info, Variable types, Warnings

- Variables: 각 feature에 대한 통계치 제공, Toglle details로 상세사항 확인 가능

ML workflow

1) 수집 (Acuisition)

- corpus(자연어 데이터)를 수집

- txt, csv, xml 등

2) 점검 및 탐색 (Inspection and Exploration)

- EDA 단계라고도 함

- 데이터 구조/특징/관계를 파악

- 시각화나 간단한 통계 테스트를 진행하기도 함

3) 전처리 및 정제 (Preprocessing and Cleaning)

- NLP의 경우 토큰화, 정제, 정규화, 불용어 제거 등을 포함

- 다양한 라이브러리에 대한 지식 필요

4) 모델링 및 훈련 (Modeling and Training)

- training set, validation set, testing set으로 나눔

- training set으로 학습하고 validation set으로 검증하며 모델 성능을 개선

5) 평가 (Evaluation)

- testing set으로 평가

6) 배포 (Deployment)

References

https://wikidocs.net/32829

https://wikidocs.net/47193

https://wikidocs.net/31947

'머신러닝, 딥러닝 > Paper Classification' 카테고리의 다른 글

NLP Cleaning and Normalization (wikidocs) (0)	2020.03.16
NLP Tokenization (wikidocs) (0)	2020.03.16
Research paper classification systems based on TF-IDF and LDA schemes (0)	2020.03.03
Text classification (wikidocs) (0)	2020.02.27
Kaggle 타이타닉 예제 (0)	2020.02.25

IT 찢는 뱁새 🐣

NLP tutorial (wikidocs)

'머신러닝, 딥러닝 > Paper Classification' 카테고리의 다른 글

댓글

티스토리툴바

NLP tutorial (wikidocs)

'머신러닝, 딥러닝 > Paper Classification' 카테고리의 다른 글

관련글

댓글

티스토리툴바