sklearn에 Gaussian Naive Bayes Classification 클래스 함수가 이미 있습니다
그것을 활용하여 간단하게 예측만 하시면 됩니다
필요 함수 링크를 주석으로 처리하여 첨부했으니 보시고 사용해주세요
import pandas as pdimport numpy as np
from sklearn.datasets import load_iris
sklearn에 내장되어있는 붓꽃 데이터를 사용할 겁니다
iris =load_iris()
붓꽃데이터를 불러옵니다
print(iris.DESCR)
_iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
"""train_test_split(*arrays, **options)(1) Parameterarrays : sequence of indexables with same length / shape[0] 분할시킬 데이터를 입력 (input format : List, Numpy array, Pandas dataframe 등..)test_size : float, int or None, optional (default=None) input format: float -> test dataset의 비율(0.0~1) int -> test dataset의 개수 (default train_size = 0.25)train_size : float, int or None, optional (default=None) input format: float -> train dataset의 비율(0.0~1) int -> train dataset의 개수 (default train_size = 1-test_size의 나머지 = 0.75)random_state : int, RandomState instance or None, optional (default=None) input format: int -> random number generator의 시드값 RandomState -> random number generator (default random_state = np.random가 제공하는 random number generato)shuffle : boolean, optional (default=True) split전 shuffle의 여부stratify : array-like or None (default=None) test,train data들을 input data의 class비율에 맞게 split할 것인지 여부(shuffle = False이면 stratify = None이어야 한다)(2) Returnsplittinglist, length=2 * len(arrays) X_train, X_test, Y_train, Y_test : arrays에 data와 label을 둘 다 넣었을 경우의 반환. data와 class의 순서쌍은 유지된다. X_train, X_test : arrays에 label 없이 data만 넣었을 경우의 반환. class 값을 포함하여 하나의 data로 반환"""X_train, X_test, y_train, y_test =train_test_split(X,y,test_size =0.2)# data : X label : y, test_set의 비율:train_ser의 비율=0.2:0.8(1-0.2)
p_spam =sum(spam_data[:,3]==1)/len(spam_data)# P(spam=1)p_spam_not =1- p_spam # P(spam=0)#베르누이 분포를 따르기 때문에 해당 상황이 일어날 확률은 1-(반대 상황이 일어날 확률)"""other solution:p_spam_not = sum(spam_data[:,3]==0)/len(spam_data)"""p_spam_not
gamble , money , hi의 조건부 확률을 다 구해주세요
P(gamble=0|spam=1) = 1 - P(gamble=1|spam=1) 의 형태로 구하면 되므로 굳이 따로 구하지 않습니다
각 값이 어떤 조건부 확률인지 이름만으로는 알기 어려울거같아 바로 옆에 주석을 달아놨습니다 참고하세요
defprocess(p_spam,p_spam_not,test,proba,proba_not): result = []for i inrange(8): a = p_spam b = p_spam_notfor j inrange(3): a = a*con_proba(test[i][j],proba[j] if test[i][j] ==1else (1-proba[j])) b = b*con_proba(test[i][j],proba_not[j] if test[i][j] ==1else (1-proba_not[j])) summation = a+b result.append([a/summation,b/summation])return result
결과 입니다 다음과 같은 값들이 똑같이 나오면 과제 성공이에요
왼쪽이 spam 메세지일 확률, 오른쪽이 spam 메세지가 아닐 확률입니다
gamble money hi라는 단어가 들어가면 들어갈수록 spam메세지인걸 알수가 있네요
spam 메세지일 확률이 0.5를 넘기는 아래에서 6,7,8행의 경우가 spam 메세지로 분류가 되겠네요
즉 gamble이라는 단어와 money 혹은 hi라는 단어가 하나라도 같이 있으면 spam메세지가 되나봐요.