데이터 전처리_중복 데이터 제거, 타입 변환, 인코딩

카테고리 없음

데이터 전처리_중복 데이터 제거, 타입 변환, 인코딩

멍주다배 2024. 12. 5.

import pandas as pd
import numpy as np

# 가상 데이터 생성
data = {
    "Name": ["Alice", "Bob", "Alice", "David", "Eve", "Frank", "Gina", "Hank", "Ivy", "Jack"],
    "Age": [25, 30, "25", 35, 29, 40, None, 33, 30, 27],
    "Gender": ["F", "M", "F", "M", "F", "M", "F", "M", "F", "M"],
    "City": ["Seoul", "Busan", "Seoul", "Daegu", "Incheon", "Busan", "Daegu", "Incheon", "Seoul", "Daegu"],
    "Salary": [50000, 60000, 50000, 55000, 62000, 58000, 0, 50000, 61000, 58000],
    "Date_Joined": ["2020-01-15", "2019-07-20", "2020-01-15", "2018-03-25", "2021-11-30", "2017-06-14", "Unknown", "2019-02-18", "2020-12-10", "2020-08-05"]
}

df = pd.DataFrame(data)

중복 데이터 제거

필요에 따라 중복된 데이터를 제거해야 모델 학습 및 분석 결과에 긍정적 영향을 줌
- 중복된 데이터가 많을수록 데이터 편향이 일어날 수 있음

df[df.duplicated()]   #중복된 행 확인

df.duplicated().sum() #중복된 행 개수
#결과: 1

df = df.drop_duplicates()		# 중복된 행 제거
df

데이터 타입 변환

데이터 타입이 잘못되어 있는 경우 모델 학습 및 분석 결과에 영향을 미칠 수 있기 때문에, 적절하게 변환해주기

df = df.dropna()		#결측치가 있는 열 제거

print(df['Age'].dtype)
df['Age'] = df['Age'].astype(int)     #정수형으로 변환
print(df['Age'].dtype)

#결과:
#float64
#int64

인코딩

범주형 데이터를 수치형 데이터로 변환하는 과정
머신러닝 모델은 수치형 데이터를 입력으로 받기 때문에 인코딩 과정이 필요함

# 범주형 데이터를 더미 변수로 변환
df_encoded = pd.get_dummies(df, columns=['City'])

df_encoded

from sklearn.preprocessing import LabelEncoder

df["Gender"] = df["Gender"].map({"F": 0, "M": 1})     #맵핑

encoder = LabelEncoder()            #라벨 인코딩
df["City"] = encoder.fit_transform(df["City"])

df

'Gender'는 맵핑으로, 'City'는 라벨 인코딩으로 인코딩한 결과

샘플링

데이터셋의 크기를 줄이거나 늘리는 과정

# 데이터셋에서 50% 샘플 추출(random_state로 고정)
sampled_df = df.sample(frac=0.5, random_state=42)

sampled_df

# 데이터셋에서 2개의 샘플 추출
sampled_df_n = df.sample(n=2)

sampled_df_n

특징 추출

중요한 특징을 선택하거나 새로운 특징을 추출하여 모델의 성능을 높임

from sklearn.feature_selection import SelectKBest, f_classif

# 특징 선택 (상위 3개의 특징 선택)
selector = SelectKBest(score_func=f_classif, k=3)
X_new = selector.fit_transform(X, y)

# 선택된 특징의 인덱스
selected_features = selector.get_support(indices=True)
print(selected_features)

# 두 열의 곱을 새로운 특징으로 추가
df['new_feature'] = df['feature1'] * df['feature2']

# 두 열의 합을 새로운 특징으로 추가
df['new_feature_sum'] = df['feature1'] + df['feature2']

현재글데이터 전처리_중복 데이터 제거, 타입 변환, 인코딩

mung_TIL

샘플링, 결측치, ML, Conflict, python, comprehesions, NumPy, prod(), 이상치, data preprocessing, merge, sum(), sql, in, Django, dataframe, GIT, 인코딩, cumprod(), collection data types,

Today :
Yesterday :

mung_TIL

데이터 전처리_중복 데이터 제거, 타입 변환, 인코딩

중복 데이터 제거

데이터 타입 변환

인코딩

샘플링

특징 추출

'카테고리 없음'의 다른글

티스토리툴바

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31