[kakao x goorm] 의사결정트리 와 Gini 기반 분할 알고리즘

[kakao x goorm] 생성 AI 응용 서비스 개발자 양성 과정/회고록

[kakao x goorm] 의사결정트리 와 Gini 기반 분할 알고리즘

Hoonia 2025. 4. 23. 14:01

오늘은 머신러닝의 대표적인 지도학습 알고리즘 중 하나인 의사결정트리(Decision Tree)에 대해 학습했다. 의사결정나무는 분류(classification)와 회귀(regression) 모두에 활용되며, 데이터의 특징(feature)에 따라 분기하면서 예측값에 도달하는 트리 구조를 가진다.

의사결정트리란?

데이터를 기반으로 Yes/No와 같은 조건 분기를 통해 결과를 예측하는 모델
복잡한 수식을 사용하지 않고도, 조건문을 통해 논리적으로 결과를 추론함
사람이 이해하기 쉬운 규칙 기반의 예측 모델이라는 점에서 해석력(interpretability)이 매우 높음
분류 문제뿐 아니라 회귀 문제에도 확장 가능

의사결정트리의 주요 구성요소

구성 요소	설명
Root Node (루트 노드)	트리의 시작점, 전체 데이터를 기준으로 가장 좋은 분할 지점
Internal Node (내부 노드)	중간에 위치하며 조건에 따라 데이터를 나누는 역할을 함
Leaf Node (단말 노드)	최종 예측 결과가 도출되는 지점, 더 이상 분할이 없음
Branch (분기)	조건에 따라 노드를 나누는 경로
Depth (깊이)	루트 노드부터 리프 노드까지의 층 수

분류 트리 vs 회귀 트리

트리 종류	목표 변수 유형	사용 목적
분류 트리 (Classification Tree)	범주형 (예: Yes/No)	클래스(카테고리)를 예측
회귀 트리 (Regression Tree)	연속형 (예: 가격, 온도)	수치를 예측

분할 기준 (Split Criterion)

1. 지니 지수 (Gini Impurity)

노드 내의 불순도(혼합 정도)를 측정하는 지표
값이 0이면 완전히 순수한 노드 (한 클래스만 존재)
값이 클수록 여러 클래스가 혼합
의사결정트리는 지니 지수가 낮아지도록 분기 기준을 선택함

Gini 계산식

$$
Gini = 1 - \sum_{i=1}^{C} p_i^2
$$

C: 클래스의 개수
: i번째 클래스의 비율

2. 정보 이익 (Information Gain)

엔트로피(Entropy)를 기반으로 한 분할 기준
분기 전과 분기 후의 엔트로피 차이를 계산하여 가장 큰 값을 주는 기준 선택

Entropy 계산식

$$
Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)
$$

: 번째 클래스의 비율
엔트로피는 불순도 또는 정보의 무질서도를 나타냄

정보 이익 계산

$$
IG = Entropy_{\text{before}} - Entropy_{\text{after}}
$$

분할 전 엔트로피에서 분할 후 엔트로피를 뺀 값
값이 클수록 좋은 분할 기준

의사결정트리 분석 과정

성장(Growing)
- 데이터를 순차적으로 분기하며 트리를 확장
- 각 분기마다 최적의 기준을 찾아 순수도(Purity)를 높임
가지치기(Pruning)
- 과적합(overfitting)을 방지하기 위해, 필요 없는 가지를 제거
- 검증 데이터를 사용하여 일반화 성능이 떨어지는 분기를 제거함
타당성 평가
- Test data, Gain Chart, Risk Chart 등을 통해 트리의 성능을 점검
해석 및 예측
- 최종 트리를 통해 새로운 데이터에 대한 예측 수행

Gini 기반 분할 알고리즘의 동작 방식

특성 정렬
- 연속형 변수일 경우, 값을 오름차순으로 정렬
후보 임계값 생성
- 인접한 값들의 평균을 계산하여 분할 기준 후보 생성
지니 불순도 계산
- 각 임계값 기준으로 데이터 분할 후, 불순도 계산
가중 평균 계산
- 전체 노드의 분할 품질을 가중 평균으로 평가
최종 분할 선택
- 가장 낮은 지니 불순도를 유도하는 특성과 임계값 조합 선택

동작과정 (예시: 테니스 경기 예측)

데일리 미션

✅ [1단계] 데이터 전처리
범주형 변수 처리 (예: Gender)
학습/테스트 데이터 분할 (train_test_split 활용)
StandardScaler를 활용한 피처 스케일링

import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

os.chdir("./")
df = pd.read_csv("marketing_click_prediction_cleaned.csv")

df.head()

df = pd.get_dummies(df, columns=["Gender"])  # Gender_Female, Gender_Male 생성됨

X = df.drop(columns=["Clicked"])
y = df["Clicked"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_scaled = scaler.fit_transform(X)

✅ [2단계] 기본 모델 성능 비교
LogisticRegression, KNeighborsClassifier, RandomForestClassifier를 학습시켜보세요.
각 모델에 대해 confusion matrix, accuracy, precision, recall, f1-score를 구하세요.
classification_report를 출력하여 비교하세요.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    confusion_matrix, accuracy_score, precision_score,
    recall_score, f1_score, classification_report, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt

# 모델 정의
logi = LogisticRegression(max_iter=1000)
forest = RandomForestClassifier()
knn = KNeighborsClassifier()

# 학습 및 예측
logi.fit(X_train_scaled, y_train)
y_pred_logi = logi.predict(X_test_scaled)

forest.fit(X_train_scaled, y_train)
y_pred_forest = forest.predict(X_test_scaled)

knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)

def plot_confusion(title, y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    disp = ConfusionMatrixDisplay(cm, display_labels=["Not Clicked", "Clicked"])
    disp.plot()
    plt.title(title)
    plt.show()

plot_confusion("Logistic Regression", y_test, y_pred_logi)
plot_confusion("Random Forest", y_test, y_pred_forest)
plot_confusion("KNN", y_test, y_pred_knn)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

logi = LogisticRegression()
logi.fit(X_train, y_train)
y_pred = logi.predict(X_test)

forest = RandomForestClassifier()
forest.fit(X_train, y_train)
y_pred_forest = forest.predict(X_test)

Kneighbors = KNeighborsClassifier()
Kneighbors.fit(X_train, y_train)
y_pred_knn = Kneighbors.predict(X_test)

def report(name, y_true, y_pred):
    print(f"{name}")
    print("Accuracy: ", accuracy_score(y_true, y_pred))
    print("Precision: ", precision_score(y_true, y_pred))
    print("Recall: ", recall_score(y_true, y_pred))
    print("F1-score: ", f1_score(y_true, y_pred))
    print("Classification Report:")
    print(classification_report(y_true, y_pred))
    print("---" * 20)

report("Logistic Regression", y_test, y_pred_logi)
report("Random Forest", y_test, y_pred_forest)
report("KNN", y_test, y_pred_knn)

"""
Logistic Regression
Accuracy:  0.975
Precision:  0.9655172413793104
Recall:  0.9655172413793104
F1-score:  0.9655172413793104
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98        51
           1       0.97      0.97      0.97        29

    accuracy                           0.97        80
   macro avg       0.97      0.97      0.97        80
weighted avg       0.97      0.97      0.97        80

------------------------------------------------------------
Random Forest
Accuracy:  0.95
Precision:  0.9032258064516129
Recall:  0.9655172413793104
F1-score:  0.9333333333333333
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.94      0.96        51
...
   macro avg       0.54      0.54      0.54        80
weighted avg       0.58      0.59      0.58        80

------------------------------------------------------------
"""

✅ [3단계] 교차검증
위 세 모델에 대해 cross_val_score를 사용하여 5-fold 교차 검증을 수행하세요.
성능 평균과 표준편차를 출력해 비교해보세요

from sklearn.model_selection import cross_val_score
import numpy as np

def evaluate_cv(name, model, X, y):
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name} - Mean Accuracy: {scores.mean():.4f}, Std Dev: {scores.std():.4f}")

evaluate_cv("Logistic Regression", logi, X_scaled, y)
evaluate_cv("Random Forest", forest, X_scaled, y)
evaluate_cv("KNN", knn, X_scaled, y)

"""
Logistic Regression - Mean Accuracy: 0.9375, Std Dev: 0.0224
Random Forest - Mean Accuracy: 0.9375, Std Dev: 0.0326
KNN - Mean Accuracy: 0.9350, Std Dev: 0.0255
"""

✅ [4단계] Stratified K-Fold
StratifiedKFold를 사용하여 로지스틱 회귀 모델을 훈련시켜보세요.
각 Fold 별 성능(정확도)를 기록하고, 평균 성능을 출력해보세요.
일반 KFold와 비교했을 때의 장단점을 기술해보세요

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_accuracies = []

for fold, (train_idx, test_idx) in enumerate(skf.split(X_scaled, y), 1):
    X_train_fold, X_test_fold = X_scaled[train_idx], X_scaled[test_idx]
    y_train_fold, y_test_fold = y.iloc[train_idx], y.iloc[test_idx]
    
    logi.fit(X_train_fold, y_train_fold)
    y_pred_fold = logi.predict(X_test_fold)
    
    accuracy = accuracy_score(y_test_fold, y_pred_fold)
    fold_accuracies.append(accuracy)
    
    print(f"Fold {fold}: Accuracy = {accuracy:.4f}")

print(f"Average Accuracy: {sum(fold_accuracies) / len(fold_accuracies):.4f}")


"""
Fold 1: Accuracy = 1.0000
Fold 2: Accuracy = 0.9500
Fold 3: Accuracy = 0.9250
Fold 4: Accuracy = 0.8750
Fold 5: Accuracy = 0.9250
Average Accuracy: 0.9350

일반적인 K-Fold는 단순히 데이터를 균등하게 나누지만, 
Stratified K-Fold는 각 fold에 클래스 비율을 동일하게 유지하여 
불균형 데이터에서도 더 신뢰할 수 있는 성능 평가를 가능하게 한다.
"""

✅ [5단계] 하이퍼파라미터 튜닝 (GridSearchCV)
RandomForestClassifier에 대해 다음과 같은 하이퍼파라미터를 튜닝해보세요:
n_estimators: [50, 100]
max_depth: [None, 5, 10]
GridSearchCV를 이용해 최적의 조합을 찾고, 최종 정확도를 출력하세요.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10]
}

grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print(f"Best Parameters: {best_params}")
print(f"Best Score: {best_score:.4f}")

"""
Best Parameters: {'max_depth': 5, 'n_estimators': 100}
Best Score: 0.9313
"""

✅ [6단계] 최종 모델 및 보고서
튜닝된 최적의 모델로 테스트 데이터를 예측하고 최종 confusion_matrix와 roc_auc_score를 계산하세요.
성능을 그래프(ROC Curve 포함) 로 시각화하세요.
각 성능 지표가 실제 서비스에서 어떤 의미를 가지는지 기술해보세요.

from sklearn.metrics import roc_auc_score, roc_curve, ConfusionMatrixDisplay

best_model = grid_search.best_estimator_

y_pred_best = best_model.predict(X_test_scaled)
y_pred_best_proba = best_model.predict_proba(X_test_scaled)[:, 1]

cm = confusion_matrix(y_test, y_pred_best)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Not Clicked", "Clicked"])
disp.plot()
plt.title("Confusion Matrix - Best Tuned Model")
plt.show()

roc_auc = roc_auc_score(y_test, y_pred_best_proba)
print(f"ROC AUC Score: {roc_auc:.4f}")

fpr, tpr, thresholds = roc_curve(y_test, y_pred_best_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.4f})")
plt.plot([0, 1], [0, 1], 'k--', label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Best Tuned Model")
plt.legend()
plt.grid()
plt.show()

# ROC AUC Score: 0.9959

최종 튜닝된 모델은 ROC AUC 0.9956으로 거의 완벽한 분류 성능을 보였다. 
Confusion Matrix 기준으로도 실제 클릭한 유저 중 96.3% 이상을 정확히 포착했으며, 정확한 타겟 예측으로 광고 낭비를 최소화할 수 있는 가능성을 보여준다. 
이는 실제 마케팅 자동화 시스템에서 광고 타겟 선별, 예산 배분, 클릭 유도 최적화에 직접 활용될 수 있는 수준의 성능이다.

오늘의 회고

의사결정트리는 복잡한 수학 없이도 데이터를 직관적으로 분석할 수 있는 강력한 도구였다. 특히 지니 지수와 정보 이익이라는 개념을 활용해, 데이터를 얼마나 깔끔하게 분리할 수 있는지에 대한 수치적 기준을 세운다는 점이 흥미로웠다. 앞으로 앙상블 모델(랜덤포레스트, 그레이디언트 부스팅 등)로 넘어가기 전, 이 기본 개념을 확실히 잡아두는 것이 중요하겠다는 생각이 들었다.