Navie Bayes 방법론

Assignment

중간 중간 assignment라고 되어있고 ''' ? ''' 이 있을거에요 그 부분을 채워주시면 됩니다.

우수과제 선정이유

나이브베이즈 관련 함수들의 인자와 반환값 형태 및 성질들을 꼼꼼히 살펴보았으며 군더더기없이 과제를 수행하였습니다.

Assignment 1 : Gaussian Naive Bayes Classification 해보기

sklearn에 Gaussian Naive Bayes Classification 클래스 함수가 이미 있습니다
그것을 활용하여 간단하게 예측만 하시면 됩니다
필요 함수 링크를 주석으로 처리하여 첨부했으니 보시고 사용해주세요

import pandas as pd
import numpy as np

from sklearn.datasets import load_iris

sklearn에 내장되어있는 붓꽃 데이터를 사용할 겁니다

iris = load_iris()

붓꽃데이터를 불러옵니다

print(iris.DESCR)

_iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

설명변수 x가 꽃받침 길이, 꽃받침 폭 , 꽃잎 길이 , 꽃잎 폭 임을 알 수 있습니다
sepal length , sepal width , petal length , petal width
타겟변수 y는 붓꽃의 품종으로 총 3가지의 종류가 있는 걸 알 수 있습니다
Iris-Setosa , Iris-Versicolour , Iris-Virginica 가 0 1 2 로 분류되어 있습니다

X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)

설명변수 x와 타겟변수 y를 판다스의 데이터프레임 형태로 만듭니다.

5.1

3.5

1.4

0.2

4.9

3.0

1.4

0.2

4.7

3.2

1.3

0.2

4.6

3.1

1.5

0.2

5.0

3.6

1.4

0.2

...

145

6.7

3.0

5.2

2.3

146

6.3

2.5

5.0

1.9

147

6.5

3.0

5.2

2.0

148

6.2

3.4

5.4

2.3

149

5.9

3.0

5.1

1.8

...

145

146

147

148

149

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

가우시안 나이브 베이즈 함수를 불러옵니다

1-1) assignment

train set과 test set을 80 대 20의 비율로 나누어 주세요
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
해당 함수를 사용하시고 사용 후에 주석으로 해당 함수에 어떤 인자들이 있는지 설명해주세요

"""
train_test_split(*arrays, **options)

(1) Parameter

arrays : sequence of indexables with same length / shape[0]
    분할시킬 데이터를 입력 (input format : List, Numpy array, Pandas dataframe 등..)

test_size : float, int or None, optional (default=None)
    input format: float -> test dataset의 비율(0.0~1)
                 int -> test dataset의 개수 (default train_size = 0.25)

train_size :  float, int or None, optional (default=None)
    input format: float -> train dataset의 비율(0.0~1)
                 int -> train dataset의 개수 (default train_size = 1-test_size의 나머지 = 0.75)

random_state : int, RandomState instance or None, optional (default=None) 
   input format: int -> random number generator의 시드값
                 RandomState  -> random number generator 
                 (default random_state = np.random가 제공하는 random number generato)
       

shuffle : boolean, optional (default=True)
    split전 shuffle의 여부

stratify : array-like or None (default=None)
    test,train data들을 input data의 class비율에 맞게 split할 것인지 여부
    
(shuffle = False이면 stratify = None이어야 한다)



(2) Return
splittinglist, length=2 * len(arrays)

    X_train, X_test, Y_train, Y_test : 
        arrays에 data와 label을 둘 다 넣었을 경우의 반환. data와 class의 순서쌍은 유지된다.

    X_train, X_test : 
        arrays에 label 없이 data만 넣었을 경우의 반환. class 값을 포함하여 하나의 data로 반환
"""
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)
# data : X   label : y, test_set의 비율:train_ser의 비율=0.2:0.8(1-0.2)

1-2) assignment

가우시안 나이브 베이즈 함수를 사용하여 학습을 시킨 후 score 값을 계산하여 주세요
모두 GaussianNB 클래스 안에 메서드로 들어있습니다
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
해당 함수를 사용하시고 사용 후에 주석으로 해당 함수에 어떤 인자들이 있는지 설명해주세요

"""
GaussianNB(priors=None, var_smoothing=1e-09)

(1) Parameter

prioprs : array-like, shape (n_classes,)
    사전확률

var_smoothing : float, optional (default=1e-9)
    분산 극단적인 값으로 가는것을 방지하기 위한 예외처리

(2) Attributes

class_count_ : array, shape (n_classes,)
    종속변수 Y의(class) 값이 특정한 클래스인 표본 데이터의 수(training sample 수)

class_prior_ : array, shape (n_classes,)
    종속변수 Y의(class) 무조건부 확률분포  P(Y)  
    
classes_ : array, shape (n_classes,)
    종속변수Y(class)의 label
    

sigma_array, shape (n_classes, n_features)
    정규분포의 분산  σ2

theta_array, shape (n_classes, n_features)
    정규분호의 기댓값 μ
    
    
"""

"""
fit(self, X, y, sample_weight=None)

(1) Parameter

X : array-like, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape (n_samples,)
Target values.

sample_weightarray-like, shape (n_samples,), optional (default=None)
Weights applied to individual samples (1. for unweighted).

New in version 0.17: Gaussian Naive Bayes supports fitting with sample_weight.

(2) Return

self : object

"""

"""
predict(self, X)

(1) Parameter

X : array-like of shape (n_samples, n_features)


(2) Retrun

C: ndarray of shape (n_samples,)
    X에 대해 예측한 target vlaue값들
"""


gnb = GaussianNB()
y_pred = gnb.fit(X_train,y_train).predict(X_test)
gnb.score(X_test, y_test)
#score = 1

1.0

Assignment 2 : Naive Bayes Classification 해보기

제가 임의로 만든 데이터 셋입니다
spam 메세지에 gamble money hi라는 단어의 유무를 기준으로 0과 1을 주었고 spam 메세지인지 아닌지를 spam에 0과1로 정해주었습니다
설명변수는 gamble, money, spam 이고 종속변수는 spam입니다(data가 세개~ sam일 확률 구하자룽)

gamble_spam = {'gamble' : [1,0,1,0,1,0,0,0,1,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,
                           1,0,1,1,0,1,0,1,0,1,1,1,1,0,0,0,1,0,1,0,1,0,1,0,1,
                           0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,1,0,1,1,1,1,1,0,0,
                           1,0,0,1,0,0,0,1,1,0,1,0,1,1,0,0,0,1,0,1,1,1,0,1,1],
               'money' : [1,1,1,0,1,0,0,0,1,0,0,0,1,0,1,1,0,1,1,0,1,1,1,1,1,
                          0,0,0,1,1,1,0,0,0,1,1,0,0,0,1,0,1,1,0,1,0,0,1,0,1,
                          1,0,1,1,0,1,0,1,0,1,1,0,0,0,1,1,0,0,0,1,1,1,1,1,1,
                          1,1,0,1,0,1,1,0,0,1,0,1,1,1,1,0,0,1,0,0,1,0,0,1,0],
               'hi' : [0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,1,1,0,1,1,1,0,0,0,0,
                       1,0,0,1,0,0,0,1,0,1,1,1,0,1,1,1,0,0,1,0,1,0,1,1,0,
                       1,0,0,0,1,1,1,1,0,1,0,1,1,0,0,1,1,1,1,0,0,0,0,0,0,
                       1,1,0,0,0,1,1,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0],
                'spam' : [1,0,1,0,1,0,0,0,1,0,0,0,1,1,1,1,0,1,1,0,0,0,1,1,0,
                          1,0,1,1,0,0,1,1,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,1,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,
                          1,0,0,1,0,0,0,1,1,0,1,0,1,1,0,0,0,1,0,1,1,1,0,1,1]}

해당 딕셔너리 데이터를 판다스 데이터 프레임으로 변경하여줍니다

df  = pd.DataFrame(gamble_spam, columns = ['gamble', 'money', 'hi', 'spam'])
df

gamble

money

spam

...

2-1) assignment

해당 판다스 데이터프레임 형식을 numpy array형식으로 변환해 주세요
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.as_matrix.html

spam_data = df.to_numpy()
spam_data
#pandas.DataFrame.to_numpy(self, dtype=None, copy=False) → numpy.ndarra

array([[1, 1, 0, 1],
       [0, 1, 1, 0],
       [1, 1, 0, 1],
       [0, 0, 1, 0],
       [1, 1, 0, 1],
       [0, 0, 1, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 1, 1, 1],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 1, 1, 1],
       [1, 0, 0, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1],
       [0, 0, 1, 0],
       [1, 1, 0, 1],
       [0, 1, 1, 1],
       [0, 0, 1, 0],
       [0, 1, 1, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 1],
       [0, 1, 0, 1],
       [0, 1, 0, 0],
       [1, 0, 1, 1],
       [0, 0, 0, 0],
       [1, 0, 0, 1],
       [1, 1, 1, 1],
       [0, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 0, 1],
       [1, 0, 1, 1],
       [0, 0, 0, 0],
       [1, 1, 1, 0],
       [1, 1, 1, 0],
       [1, 0, 1, 0],
       [1, 0, 0, 1],
       [0, 0, 1, 1],
       [0, 1, 1, 0],
       [0, 0, 1, 1],
       [1, 1, 0, 0],
       [0, 1, 0, 0],
       [1, 0, 1, 1],
       [0, 1, 0, 0],
       [1, 0, 1, 1],
       [0, 0, 0, 0],
       [1, 1, 1, 1],
       [0, 0, 1, 0],
       [1, 1, 0, 1],
       [0, 1, 1, 0],
       [0, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 1, 1, 0],
       [0, 0, 1, 0],
       [0, 1, 1, 0],
       [1, 0, 0, 0],
       [0, 1, 1, 0],
       [0, 1, 0, 0],
       [1, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 0],
       [1, 1, 0, 0],
       [1, 1, 1, 0],
       [1, 0, 1, 1],
       [0, 0, 1, 1],
       [1, 0, 1, 1],
       [1, 1, 0, 1],
       [1, 1, 0, 1],
       [1, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0],
       [1, 1, 1, 1],
       [0, 1, 1, 0],
       [0, 0, 0, 0],
       [1, 1, 0, 1],
       [0, 0, 0, 0],
       [0, 1, 1, 0],
       [0, 1, 1, 0],
       [1, 0, 0, 1],
       [1, 0, 1, 1],
       [0, 1, 0, 0],
       [1, 0, 1, 1],
       [0, 1, 1, 0],
       [1, 1, 0, 1],
       [1, 1, 0, 1],
       [0, 1, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 1, 0, 1],
       [0, 0, 1, 0],
       [1, 0, 0, 1],
       [1, 1, 0, 1],
       [1, 0, 0, 1],
       [0, 0, 1, 0],
       [1, 1, 0, 1],
       [1, 0, 0, 1]], dtype=int64)

2-2) assignment

P(spam=1), P(spam=0)
P(gamble=1|spam=1), P(money=1|spam=1), P(hi=1|spam=1)
P(gamble=1|spam=0), P(money=1|spam=0), P(hi=1|spam=0)
위의 확률들을 구하여 주세요
먼저 P(spam)의 확률 부터 구해주세요
P(spam=1)인 경우만 제가 완성을 해두었습니다 참고하시면 금방 금방 채우실거에요 :)

p_spam = sum(spam_data[:,3]==1)/len(spam_data) # P(spam=1)
p_spam_not = 1- p_spam # P(spam=0)

#베르누이 분포를 따르기 때문에 해당 상황이 일어날 확률은 1-(반대 상황이 일어날 확률)

"""
other solution:
p_spam_not = sum(spam_data[:,3]==0)/len(spam_data)

"""
p_spam_not

gamble , money , hi의 조건부 확률을 다 구해주세요
P(gamble=0|spam=1) = 1 - P(gamble=1|spam=1) 의 형태로 구하면 되므로 굳이 따로 구하지 않습니다
각 값이 어떤 조건부 확률인지 이름만으로는 알기 어려울거같아 바로 옆에 주석을 달아놨습니다 참고하세요
위랑 마찮가지로 제일 위에 껀 제가 해놨어요 참고해서 한번 구해보세요

0.5800000000000001

p_gamble_spam = sum((spam_data[:, 0] == 1) & (spam_data[:, 3] == 1)) / sum(spam_data[:, 3] == 1) 
# P(gamble=1|spam=1) = P(gambled=1∩spam=1)/P(spam=1)
p_gamble_spam_not = sum((spam_data[:, 0] == 1) & (spam_data[:, 3] == 0)) / sum(spam_data[:, 3] == 0) 
# P(gamble=1|spam=0) = P(gambled=1∩spam=0)/P(spam=0)

p_money_spam = sum((spam_data[:, 1] == 1) & (spam_data[:, 3] == 1)) / sum(spam_data[:, 3] == 1)  
# P(money=1|spam=1) = P(money=1∩spam=1)/P(spam=1)
p_money_spam_not = sum((spam_data[:, 1] == 1) & (spam_data[:, 3] == 0)) / sum(spam_data[:, 3] == 0) 
# P(money=1|spam=0) = P(money=1∩spam=0)/P(spam=0)

p_hi_spam = sum((spam_data[:, 2] == 1) & (spam_data[:, 3] == 1)) / sum(spam_data[:, 3] == 1) 
# P(hi=1|spam=1) = P(hi=1∩spam=1)/P(spam=1)
p_hi_spam_not = sum((spam_data[:, 2] == 1) & (spam_data[:, 3] == 0)) / sum(spam_data[:, 3] == 0) 
# P(hi=1|spam=0) = P(hi=1∩spam=0)/P(spam=0)

p_hi_spam_not

0.4482758620689655

이제 P(|spam=1)값 리스트와 P(|spam=0)값 리스트를 생성해줍니다

proba = [p_gamble_spam,p_money_spam,p_hi_spam]
proba_not = [p_gamble_spam_not,p_money_spam_not,p_hi_spam_not]
proba

[0.8333333333333334, 0.5476190476190477, 0.4523809523809524]

요건 테스트 셋이에요
예를 들어 [0,1,0]인 경우 gamble=0,money=1,hi=0인 경우에 spam인지 아닌지 확률을 계산해 달라는 의미 입니다
설명변수가 3개 밖에 안되기때문에 [0,0,0] ~ [1,1,1] 8가지 모든 경우에 대해 확률 P(*|spam=1)를 구할 거에요

test = [[i,j,k] for i in range(2) for j in range(2) for k in range(2)]

test

[[0, 0, 0],
 [0, 0, 1],
 [0, 1, 0],
 [0, 1, 1],
 [1, 0, 0],
 [1, 0, 1],
 [1, 1, 0],
 [1, 1, 1]]

2-3) assignment

조건부 확률을 구하는 함수를 구해주세요
x는 해당 독립변수가 0인지 1인지를 받는 인자이구요
p는 해당독립변수가 1일때의 조건부 확률이 들어갑니다
P(X=x|Y=1) = xP(X=1|Y=1)+(1-x)P(X=0|Y=1)을 응용하세요

def con_proba(x,p):
    #money인경우 x=1 == money=1 p==P(money=1|spam=1)/x=0 == money=0 p==P(money=0|spam=1)
    return (x*p+(1-x)*p) #P(X=x|Y=1) = xP(X=1|Y=1)+(1-x)P(X=0|Y=1)

test경우에 대해 각 확률을 반환해주는 함수를 생성해주세요

def process(p_spam,p_spam_not,test,proba,proba_not):
    result = []
    for i in range(8):
        a = p_spam
        b = p_spam_not
        for j in range(3):
            a = a*con_proba(test[i][j],proba[j] if test[i][j] == 1 else (1-proba[j]))
            b = b*con_proba(test[i][j],proba_not[j] if test[i][j] == 1 else (1-proba_not[j]))
        summation = a+b
        result.append([a/summation,b/summation])
    return result

결과 입니다 다음과 같은 값들이 똑같이 나오면 과제 성공이에요
왼쪽이 spam 메세지일 확률, 오른쪽이 spam 메세지가 아닐 확률입니다
gamble money hi라는 단어가 들어가면 들어갈수록 spam메세지인걸 알수가 있네요
spam 메세지일 확률이 0.5를 넘기는 아래에서 6,7,8행의 경우가 spam 메세지로 분류가 되겠네요
즉 gamble이라는 단어와 money 혹은 hi라는 단어가 하나라도 같이 있으면 spam메세지가 되나봐요.

process(p_spam,p_spam_not,test,proba,proba_not)

성공!

[[0.12561158047604412, 0.874388419523956],
 [0.12744440801721746, 0.8725555919827825],
 [0.1315383089295994, 0.8684616910704007],
 [0.13344441688939654, 0.8665555831106033],
 [0.7542408952456083, 0.24575910475439158],
 [0.7573019784074859, 0.24269802159251408],
 [0.7639150506833852, 0.23608494931661478],
 [0.7668928774284339, 0.23310712257156604]]

PreviousKNN을 통한 Parameter Tuning Next앙상블

Last updated 5 years ago

Was this helpful?