sklearn에 Gaussian Naive Bayes Classification 클래스 함수가 이미 있습니다
그것을 활용하여 간단하게 예측만 하시면 됩니다
필요 함수 링크를 주석으로 처리하여 첨부했으니 보시고 사용해주세요
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
sklearn에 내장되어있는 붓꽃 데이터를 사용할 겁니다
iris = load_iris()
붓꽃데이터를 불러옵니다
print(iris.DESCR)
_iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
p_spam = sum(spam_data[:,3]==1)/len(spam_data) # P(spam=1)
p_spam_not = 1- p_spam # P(spam=0)
#베르누이 분포를 따르기 때문에 해당 상황이 일어날 확률은 1-(반대 상황이 일어날 확률)
"""
other solution:
p_spam_not = sum(spam_data[:,3]==0)/len(spam_data)
"""
p_spam_not
gamble , money , hi의 조건부 확률을 다 구해주세요
P(gamble=0|spam=1) = 1 - P(gamble=1|spam=1) 의 형태로 구하면 되므로 굳이 따로 구하지 않습니다
각 값이 어떤 조건부 확률인지 이름만으로는 알기 어려울거같아 바로 옆에 주석을 달아놨습니다 참고하세요
def process(p_spam,p_spam_not,test,proba,proba_not):
result = []
for i in range(8):
a = p_spam
b = p_spam_not
for j in range(3):
a = a*con_proba(test[i][j],proba[j] if test[i][j] == 1 else (1-proba[j]))
b = b*con_proba(test[i][j],proba_not[j] if test[i][j] == 1 else (1-proba_not[j]))
summation = a+b
result.append([a/summation,b/summation])
return result
결과 입니다 다음과 같은 값들이 똑같이 나오면 과제 성공이에요
왼쪽이 spam 메세지일 확률, 오른쪽이 spam 메세지가 아닐 확률입니다
gamble money hi라는 단어가 들어가면 들어갈수록 spam메세지인걸 알수가 있네요
spam 메세지일 확률이 0.5를 넘기는 아래에서 6,7,8행의 경우가 spam 메세지로 분류가 되겠네요
즉 gamble이라는 단어와 money 혹은 hi라는 단어가 하나라도 같이 있으면 spam메세지가 되나봐요.