[빅데이터분석기사/실기] 제2유형. 머신러닝 및 평가지표

빅데이터 분석기사/실기 요약

[빅데이터분석기사/실기] 제2유형. 머신러닝 및 평가지표

✨️데이터분석가✨️ 2025. 6. 8. 00:01

728x90

제2유형은 머신러닝에 대한 문제이며, 1문제이지만 배점이 40점으로 가장 높습니다.

주어진 라이브러리와 데이터를 불러온 뒤, 데이터 전처리를 거쳐 모델을 학습시키고 평가한 결과를 CSV 코드 형식으로 제출하는 유형입니다.

문제의 난이도가 다소 높은 유형이지만, 전반적인 풀이 방식이 어느 정도 정형화되어 있어,

풀이 방식을 암기하여 충분히 맞출 수 있습니다.

자, 그럼 문제를 풀이 단계별로 확인해 볼까요?

구분	문항 수	답안 제출	점수
제1유형	3문항	전처리 결과 제출	30점
제2유형	1문항	CSV 코드 제출	40점
제3유형	2문항	답안 제출	30점
합계	6문항	-	100점

[0단계] 머신러닝 - 지도학습(분류, 회귀) / 비지도학습(군집, 차원축소) / 강화학습
① 문제 정의 및 데이터 불러오기: 분류/회귀, 예측 컬럼/결과, 평가방식, 최종파일

Train 학습.검증용, Test 평가(예측)용
② 탐색적 데이터 분석(EDA): 데이터 크기, 타입, 분류 비율 등 데이터 확인
③ 데이터 전처리: 결측치 처리, 이상치 처리
④ 피처 엔지니어링: 수치형(min-max 스케일링, 표준화), 범주형(라벨 인코딩, 원핫 인코딩)
⑤ 모델(훈련, 평가 등): 분류 모델, 회귀 모델
⑥ 예측: Test 데이터로 평가해서 예측모델 제출

[1단계] 라이브러리 및 데이터 불러오기

1) 데이터 확인

import pandas as pd

X_train = pd.read_csv(" ~ ")

y_train = pd.read_csv(" ~ ")

X_test = pd.read_csv(" ~ ")

X_train.head() # 상위 5개 추출

X_train.tail() # 하위 5개 추출

X_train.sample(3) # 랜덤 3개 추출

X_train.shape # 행열 개수 추출

X_train.info() # 데이터 타입 확인

X_train.describe() # 수치형 데이터 통계값 확인

X_train.describe(include='object') # 범주형 데이터 통계값 확인

X_train.isnull().sum() # 결측치 확인

y_train['income'].value_counts() # 라벨별 개수 확인

X_train.corr(numeric_only=True) # 상관관계

train = pd.read_csv(" ~ ")

test = pd.read_csv(" ~ ")

train.shape, test.shape # 데이터 크기 확인

bottom = train['income'] == "<=50K" # 하위 소득

female = train['sex'] == "Female"

male = train['sex'] == "Male"

print( len(train[male]), len(train[female]) ) # 남성과 여성의 수

print( len(train[male&top]), len(train[male&bottom]) ) # 남성 중 상위/하위소득 인원 수

print( len(train[male&top])/len(train[male]), len(train[male&bottom])/len(train[male]) ) # 남성 중 상위/하위소득 비율

19578 9726
5976 13602
0.3052405761569108 0.6947594238430892

2) 데이터 합치기

(29304, 16) (29304, 15) (29304, 2)

(29304, 16)

3) 데이터 분리하기

((29304, 15), (29304, 2))

[2단계] 데이터 전처리

test 데이터는 절대 삭제하면 안됨
train 데이터는 데이터가 많을 경우, 소수 데이터를 삭제해도 무방함
test/train 컬럼은 삭제/추가 가능 (단, 컬럼수/컬럼명은 일치해야 함)

1) 중복값/결측치 처리

X_train['workclass'].value_counts() # workclass 값 분포

df = X_train.dropna() # 결측치 있는 데이터(행) 전체 삭제

df = X_train.dropna(subset=['workclass', 'native.country']) # 특정 컬럼의 결측치 데이터(행) 삭제

df = X_train.dropna(axis=1) # 결측치가 있는 컬럼 모두 삭제 (axis=1은 컬럼)

df = X_train.drop(['workclass'], axis=1) # 특정 컬럼 삭제

df = X_train.drop_duplicates() # 중복값 제거

df = X_train.drop_duplicates(subset=['workclass', 'native.country'], keep='last') # 특정 컬럼만 중복값 제거 (last는 뒤에 값을 살림)

X_train['workclass'] = X_train['workclass'].fillna(X_train['workclass'].mode()[0]) # 최빈값으로 결측치 채우기

X_train['occupation'] = X_train['occupation'].fillna('X') # 특정값으로 결측치 채우기

X_train['age'] = X_train['age'].fillna(int(X_train['age'].mean())) # 평균값으로 결측치 채우기

2) 이상치 제외하고 저장

(29304, 15)
(29301, 15)

3) 컬럼별 이상치 개수 확인

age의 이상치: 121개 입니다.
fnlwgt의 이상치: 892개 입니다.
education.num의 이상치: 1077개 입니다.
capital.gain의 이상치: 2459개 입니다.
capital.loss의 이상치: 1359개 입니다.
hours.per.week의 이상치: 8104개 입니다.

[3단계] 피처 엔지니어링

1) Min-Max Scaler (모든 값이 0~1)

obj_train = X_train.select_dtypes(include='object').copy() # object 유형만 저장

num_train = X_train.select_dtypes(exclude='object').copy() # object 외 유형 저장

obj_test = X_test.select_dtypes(include='object').copy()

num_test = X_test.select_dtypes(exclude='object').copy()

cols = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week'] # num 유형만 저장

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

print( num_train.head(2) )

num_train[cols] = scaler.fit_transform(num_train[cols]) # 학습과 변환 작업 진행

num_test[cols] = scaler.transform(num_test[cols]) # 변환 작업만 진행

print( num_train.head(2) ) # Min-Max Sacler 적용된 데이터

      id   age  fnlwgt  education.num  capital.gain  capital.loss  \
0   3331  34.0  177331             10          4386             0   
1  19749  58.0  290661              9             0             0   

   hours.per.week  
0            40.0  
1            40.0  
      id     age    fnlwgt  education.num  capital.gain  capital.loss  \
0   3331  0.5625  0.112092       0.600000       0.04386           0.0   
1  19749  0.7500  0.189060       0.533333       0.00000           0.0   

   hours.per.week  
0        0.397959  
1        0.397959

2) 표준화 (Z-score 정규화, 평균 0 표준편차 1인 표준정규분포로 변환)

      id     age    fnlwgt  education.num  capital.gain  capital.loss  \
0   3331  0.5625  0.112092       0.600000       0.04386           0.0   
1  19749  0.7500  0.189060       0.533333       0.00000           0.0   

   hours.per.week  
0        0.397959  
1        0.397959  
      id       age    fnlwgt  education.num  capital.gain  capital.loss  \
0   3331 -0.334145 -0.117678      -0.031447      0.440284     -0.216045   
1  19749  1.427220  0.956304      -0.420434     -0.146290     -0.216045   

   hours.per.week  
0       -0.035227  
1       -0.035227

3) 로버스트 스케일링 (중앙값과 사분위값 활용, 이상치 영향 최소화)

      id       age    fnlwgt  education.num  capital.gain  capital.loss  \
0   3331 -0.334145 -0.117678      -0.031447      0.440284     -0.216045   
1  19749  1.427220  0.956304      -0.420434     -0.146290     -0.216045   

   hours.per.week  
0       -0.035227  
1       -0.035227  
      id   age    fnlwgt  education.num  capital.gain  capital.loss  \
0   3331 -0.15 -0.008765       0.000000      0.586575           0.0   
1  19749  1.05  0.941358      -0.333333      0.000000           0.0   

   hours.per.week  
0             0.0  
1             0.0

4) 로그 변환

import numpy as np

print(X_train['fnlwgt'][:3])

print(np.log1p(X_train['fnlwgt'])[:3]) # log1p 적용

print(np.exp(np.log1p(X_train['fnlwgt']))) # 다시 원래값으로 변환 (1~2정도 차이는 있음)

0    177331
1    290661
2    125933
Name: fnlwgt, dtype: int64
0    12.085779
1    12.579916
2    11.743513
Name: fnlwgt, dtype: float64
0        177332.0
1        290662.0
2        125934.0
3        100314.0
4        195662.0
           ...   
29299     47169.0
29300    231794.0
29301    201436.0
29302    137723.0
29303    406979.0
Name: fnlwgt, Length: 29304, dtype: float64

5-1) 라벨 인코딩 - 값에 라벨을 붙여서 데이터를 구성 (예, 사과=1, 배=2)

cols = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country'] # obj 유형만 저장

from sklearn.preprocessing import LabelEncoder

for col in cols:

le = LabelEncoder()

obj_train[col] = le.fit_transform(obj_train[col])

obj_test[col] = le.transform(obj_test[col])

obj_train.head()

5-2) 원핫 인코딩 - 값을 컬럼으로 생성하여, 데이터는 0(비해당) or 1(해당)로 구성 (예, 사과 컬럼=0, 1, 0, 0)

obj_train.head()

obj_train = pd.get_dummies(obj_train[cols])

obj_test = pd.get_dummies(obj_test[cols])

obj_train.head()

6) 데이터 합치기

cols = list(X_train.columns[X_train.dtypes == object]) # 문자형 데이터만 저장

print(X_train.shape, X_test.shape)

all_df = pd.concat([X_train, X_test]) # 두 데이터를 위 아래로 합침 (asis=0, 기본값)

all_df = pd.get_dummies(all_df[cols]) # 원핫 인코딩

line = int(X_train.shape[0]) # train, test 데이터가 연결된 지점 확인

print(line)

X_train = all_df.iloc[:line,:].copy() # train 데이터만 분리

X_test = all_df.iloc[line:,:].copy() # test 데이터만 분리

print(X_train.shape, X_test.shape)

(29304, 15) (3257, 15)
29304
(29304, 99) (3257, 99)

[4단계] 모델링 및 평가 (분류)

0) 데이터 전처리

cols =['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week'] # 수치형 데이터 저장

print( X_train[cols].isnull().sum() ) # 결측치 확인

X_train = X_train.fillna(0) # 결측치 0으로 보정

X_test = X_test.fillna(0) # 결측치 0으로 보정

print( X_train[cols].isnull().sum() ) # 결측치 확인

y = (y_train['income'] == '>50K').astype(int) # <=50k →0, >50K →1로 변환/ True,False로 만든 후 정수로 변환

1) 랜덤포레스트

from sklearn.ensemble import RandomForestClassifier # 모델 불러오기

rf = RandomForestClassifier()

rf.fit(X_train[cols], y) # 훈련하기

pred = rf.predict(X_test[cols]) # 예측하기

submit = pd.DataFrame(

{

'id':X_test['id'],

'income':pred

}

)

submit.to_csv("11111.csv", index=False) # csv 생성 (index=False, 컬럼명이 있다는 의미)

2) 정확도 모델

0.8090267116978815

3) 학습용/검증용 데이터 분리

((26373, 15), (2931, 15), (26373,), (2931,))

4) 의사결정나무

0.6993277847701769

5) 랜덤포레스트

0.8475316831271521

6) XGBoost

0.8859038904196574

7) 평가용 데이터 >50k일 확률값을 예측한 값을 csv로 저장

# 평가 데이터로 >50k일 확률값을 예측한 값을 csv로 저장

from sklearn.metrics import roc_auc_score

roc_auc_score(y_val, pred[:,1])

pred = xgb.predict_proba(X_test[cols])

submit = pd.DataFrame(

{

'id':X_test['id'],

'income':pred[:,1]

}

)

submit.to_csv("22222.csv", index=False)

[5단계] 모델링 및 평가 (회귀)

0) 데이터 전처리

# 원핫 인코딩

cols = train.select_dtypes(include='object').columns

train = pd.get_dummies(train, columns=cols)

test = pd.get_dummies(test, columns=cols)

# 검증 데이터 분리

from sklearn.model_selection import train_test_split

X_tr, X_val, y_tr, y_val = train_test_split(train.drop('charges',axis=1), train['charges'], test_size=0.15, random_state = 2022)

X_tr.shape, X_val.shape, y_tr.shape, y_val.shape

# 평가 수식

from sklearn.metrics import mean_squared_error

import numpy as np

def rmse(y_test, pred): # 실제값, 예측값

return np.sqrt(mean_squared_error(y_test, pred))

1) LinearRegression → 결과: 5855

5888.05802236533

2) StandardScaler / LinearRegression → 결과: 5855 (차이가 없음)

5888.05802236533

3) RandomForestReggressor → 결과: 4671

4671.528967705076

4) StandardScaler / RandomForestReggressor → 결과: 4732 (안 좋아짐)

4732.311401989539

5) MinMaxScaler / RandomForestReggressor → 결과: 4672 (안 좋아짐)

4672.464232356961

6) MinMaxScaler & bmi만 적용/ RandomForestReggressor → 결과: 4712 (안 좋아짐)

4712.997771609883

7) log 변환 / RandomForestReggressor → 결과: 4605 (가장 좋음)

import numpy as np

train['charges'] = np.log1p(train['charges'])

# 검증 데이터 분리

from sklearn.model_selection import train_test_split

X_tr, X_val, y_tr, y_val = train_test_split(train.drop('charges',axis=1), train['charges'], test_size=0.15, random_state = 2022)

X_tr.shape, X_val.shape, y_tr.shape, y_val.shape

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

model.fit(X_tr, y_tr)

pred = model.predict(X_val)

rmse(np.exp(y_val), np.exp(pred))

4605.62579084553

8) XGBRegressor → 결과: 4903 (가장 안 좋음)

4903.671366916808

9) csv 파일 생성

pred = model.predict(test) # 데이터 예측

submit = pd.DataFrame(

{

'id':test['id'],

'charges':pred

}

)

submit.to_csv('33333.csv', index=False) # csv파일 생성

728x90

저작자표시 비영리 변경금지 (새창열림)

'빅데이터 분석기사 > 실기 요약' 카테고리의 다른 글

[빅데이터분석기사/실기] 제2유형. 모의문제 (0)	2025.06.08
[빅데이터분석기사/실기] 제2유형. 평가지표 (0)	2025.06.08
[빅데이터분석기사/실기] 제1유형. 판다스 실습 - 모의문제 (1)	2025.06.07
[빅데이터분석기사/실기] 제1유형. 판다스 실습 - 문자열 (0)	2025.06.06
[빅데이터분석기사/실기] 제1유형. 판다스 실습 - 날짜 변환 (0)	2025.06.06

현재글[빅데이터분석기사/실기] 제2유형. 머신러닝 및 평가지표

데이터분석 공부일지