Python을 이용한 Decision Tree (2)

DT Assignment2

Entropy를 구하고, 각 변수에 대한 Gain을 구하는 함수를 구현하는 과제입니다.

DT_Assignment2.ipynb 파일에 있는 두가지 함수를 만들어 주시면 됩니다. 결과는 주어져 있습니다.

두번째 함수는 출력값이 꼭 주어진 형태와 일치할 필요는 없습니다. 봤을 때 각 변수에 대한 Gain을 알아볼 수 있도록 구성해 주세요.

마찬가지로 주석 꼼꼼히 달아주세요!

우수과제 선정이유

함수마다 잘 작동하는지 확인하는 과정을 거치는 것이 인상적이였습니다.

또한 1번 과제의 3번 문제를 해결할 때 데이터 프레임을 나누는 함수를 만들고 문제에서 요구하는 것보다 한층 더 깊게 살펴보는 모습이 굉장히 좋았습니다.

import pandas as pd 
import numpy as np
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('https://raw.githubusercontent.com/AugustLONG/ML01/master/01decisiontree/AllElectronics.csv')
df.drop("RID",axis=1, inplace = True) #RID는 그냥 Index라서 삭제

In [117]:

df

Out[117]:

age

income

student

credit_rating

class_buys_computer

youth

high

fair

youth

high

excellent

middle_aged

high

fair

yes

senior

medium

fair

yes

senior

low

yes

fair

yes

senior

low

yes

excellent

middle_aged

low

yes

excellent

yes

youth

medium

fair

youth

low

yes

fair

yes

senior

medium

yes

fair

yes

youth

medium

yes

excellent

yes

middle_aged

medium

excellent

yes

middle_aged

high

yes

fair

yes

senior

medium

excellent

함수 만들기

In [118]:

from functools import reduce

def getEntropy(df, feature) :
    D_len = df[feature].count() # 데이터 전체 길이
    # reduce함수를 이용하여 초기값 0에 
    # 각 feature별 count을 엔트로피 식에 대입한 값을 순차적으로 더함
    return reduce(lambda x, y: x+(-(y[1]/D_len) * np.log2(y[1]/D_len)), \
                  df[feature].value_counts().items(), 0)

In [119]:

getEntropy(df, "class_buys_computer")

Out[119]:

0.9402859586706311

In [5]:

# 정답
getEntropy(df, "class_buys_computer")

Out[5]:

0.9402859586706311

In [159]:

def get_target_true_count(col, name, target, true_val, df=df):
    """
    df[col]==name인 조건에서 Target이 참인 경우의 갯수를 반환
    """
    return df.groupby([col,target]).size()[name][true_val]

def NoNan(x):
    """
    Nan의 경우 0을 반환
    """
    return np.nan_to_num(x)

def getGainA(df, feature) :
    info_D = getEntropy(df, feature) # 목표변수 Feature에 대한 Info(Entropy)를 구한다.
    columns = list(df.loc[:, df.columns != feature]) # 목표변수를 제외한 나머지 설명변수들을 리스트 형태로 저장한다.
    gains = []
    D_len = df.shape[0] # 전체 길이
    for col in columns:
        info_A = 0
        # Col내 개별 Class 이름(c_name)과 Class별 갯수(c_len)
        for c_name, c_len in df[col].value_counts().items():
            target_true = get_target_true_count(col, c_name, feature, 'yes') 
            prob_t = target_true / c_len
            # Info_A <- |Dj|/|D| *  Entropy(label) | NoNan을 이용해 prob_t가 0인 경우 nan이 나와 생기는 오류 방지
            info_A += (c_len/D_len) * -(NoNan(prob_t*np.log2(prob_t)) + NoNan((1 - prob_t)*np.log2(1 - prob_t)))
        gains.append(info_D - info_A)
    
    result = dict(zip(columns,gains)) # 각 변수에 대한 Information Gain 을 Dictionary 형태로 저장한다.
    return(result)

In [161]:

df.groupby(['age','class_buys_computer']).size()

Out[161]:

age          class_buys_computer
middle_aged  yes                    4
senior       no                     2
             yes                    3
youth        no                     3
             yes                    2
dtype: int64

In [160]:

getGainA(df, "class_buys_computer")

Out[160]:

{'age': 0.24674981977443933,
 'income': 0.02922256565895487,
 'student': 0.15183550136234159,
 'credit_rating': 0.04812703040826949}

정답

{'age': 0.24674981977443933, 
 'income': 0.02922256565895487, 
 'student': 0.15183550136234159, 
 'credit_rating': 0.04812703040826949}

결과 확인하기

In [139]:

my_dict = getGainA(df, "class_buys_computer")
def f1(x):
    return my_dict[x]
key_max = max(my_dict.keys(), key=f1)
print('정보 획득이 가장 높은 변수는',key_max, "이며 정보 획득량은", my_dict[key_max], "이다.")

정보 획득이 가장 높은 변수는 age 이며 정보 획득량은 0.24674981977443933 이다.

In [7]:

# 정답
my_dict = getGainA(df, "class_buys_computer")
def f1(x):
    return my_dict[x]
key_max = max(my_dict.keys(), key=f1)
print('정보 획득이 가장 높은 변수는',key_max, "이며 정보 획득량은", my_dict[key_max], "이다.")

정보 획득이 가장 높은 변수는 age 이며 정보 획득량은 0.24674981977443933 이다.

PreviousPython을 이용한 Decision Tree (1)NextPython을 이용한 Decision Tree (3)

Last updated 5 years ago

Was this helpful?