# 클러스터링 실습 (2)(EDA,Sklearn)

## Assignment 3

#### - Clustering 해보기 <a href="#clustering" id="clustering"></a>

{% hint style="info" %}
**우수과제 선정이유**&#x20;

각 클러스터링 방법에 대해 포인트를 잘 잡고 과제를 진행해주시며 마지막에 결론에 해당하는 시각화까지 완벽하게 해주셔서 우수과제로 선정되었습니다.
{% endhint %}

## Load Dataset

**Import packages**

In \[1]:

```python
# data
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore") 

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

# model
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from scipy.cluster.hierarchy import dendrogram, ward
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import MeanShift, estimate_bandwidth

# grid search
from sklearn.model_selection import GridSearchCV

# evaluation
from sklearn.metrics.cluster import silhouette_score
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import *
```

**Load mall customers data**

In \[2]:

```python
df = pd.read_csv('Mall_Customers.csv')
df.head()
```

Out\[2]:

|   | CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) |
| - | ---------- | ------ | --- | ------------------ | ---------------------- |
| 0 | 1          | Male   | 19  | 15                 | 39                     |
| 1 | 2          | Male   | 21  | 15                 | 81                     |
| 2 | 3          | Female | 20  | 16                 | 6                      |
| 3 | 4          | Female | 23  | 16                 | 77                     |
| 4 | 5          | Female | 31  | 17                 | 40                     |

In \[3]:

```python
del df["CustomerID"]
```

* ID 값은 clustering을 하는데 있어 필요하지 않아 보이므로 제거하기로 하였다.

In \[4]:

```python
df['Gender'].unique()
```

Out\[4]:

```python
array(['Male', 'Female'], dtype=object)
```

In \[5]:

```python
df['Gender'].replace({'Male':1, 'Female':0},inplace=True)
```

* 문자형 데이터를 encoding 하였다.

In \[6]:

```python
df.head()
```

Out\[6]:

|   | Gender | Age | Annual Income (k$) | Spending Score (1-100) |
| - | ------ | --- | ------------------ | ---------------------- |
| 0 | 1      | 19  | 15                 | 39                     |
| 1 | 1      | 21  | 15                 | 81                     |
| 2 | 0      | 20  | 16                 | 6                      |
| 3 | 0      | 23  | 16                 | 77                     |
| 4 | 0      | 31  | 17                 | 40                     |

In \[7]:

```python
df.shape
```

Out\[7]:

```
(200, 4)
```

## EDA

### **Describe**

In \[8]:

```python
df.info()
```

```python
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
Gender                    200 non-null int64
Age                       200 non-null int64
Annual Income (k$)        200 non-null int64
Spending Score (1-100)    200 non-null int64
dtypes: int64(4)
memory usage: 6.4 KB
```

In \[9]:

```python
df.isnull().sum()
```

Out\[9]:

```python
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64
```

* null 값이 존재하지 않는다 !

In \[10]:

```python
df.describe()
```

Out\[10]:

|       | Gender     | Age        | Annual Income (k$) | Spending Score (1-100) |
| ----- | ---------- | ---------- | ------------------ | ---------------------- |
| count | 200.000000 | 200.000000 | 200.000000         | 200.000000             |
| mean  | 0.440000   | 38.850000  | 60.560000          | 50.200000              |
| std   | 0.497633   | 13.969007  | 26.264721          | 25.823522              |
| min   | 0.000000   | 18.000000  | 15.000000          | 1.000000               |
| 25%   | 0.000000   | 28.750000  | 41.500000          | 34.750000              |
| 50%   | 0.000000   | 36.000000  | 61.500000          | 50.000000              |
| 75%   | 1.000000   | 49.000000  | 78.000000          | 73.000000              |
| max   | 1.000000   | 70.000000  | 137.000000         | 99.000000              |

### **Visualization**

In \[11]:

```python
sns.countplot('Gender' , data = df)
plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2F4wBlCY0RUhkG2zFxGcpV%2Ffile.png?alt=media)

* Female인 경우가 더 많은 것을 알 수 있다.

In \[12]:

```python
sns.pairplot(df, hue="Gender")
plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FzIn0kWRcbl03aLQi5GIE%2Ffile.png?alt=media)

* 성별에 따른 pairplot을 찍어보았는데 구분할 수 있을 정도의 차이를 보이지는 않았다.

In \[13]:

```python
g = sns.heatmap(df.corr(), annot=True, linewidths=.5)
bottom, top = g.get_ylim() # heatmap plot이 잘리는 것 방지하기 위한 방법
g.set_ylim(bottom+0.5, top-0.5)
plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FiDwzvlpNV0RSeCW1mVXq%2Ffile.png?alt=media)

* 변수 간 선형상관관계가 거의 없어 보인다.

## Modeling

### **PCA**

In \[14]:

```python
pca = PCA(n_components=2)
reduced_df = pca.fit_transform(df)
reduced_df.shape
```

Out\[14]:

```
(200, 2)
```

* 평면에 시각화를 위해 PCA를 이용해 2차원으로 차원 축소를 진행하였다.
* 평가는 시각화 및 Silhouette Coefficient, Davies bouldin score로 진행하였다.
* ***Silhouette Coefficient***&#xB294; 실루엣 계수로 -1 부터 1사이의 값을 가지며 1에 가까울 수록 최적화된 군집이라고 할 수 있다.
* ***Davies Bouldin Index***&#xB294; Group 내에서의 Distribution과 비교하여 다른 Group간의 분리 정도의 비율로 계산되는 값으로 모든 두 개의 Group 쌍에 대해 각 Group의 크기의 합을 각 Group의 중심 간 거리로 나눈 값으로서 표현되는 함수이다. 즉, 값이 작을수록 최적화된 군집이라고 할 수 있다.

### 1. K-Means

In \[15]:

```python
distortions = []
for k in range(2, 20):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(reduced_df)
    distortions.append(kmeans.inertia_)

fig = plt.figure(figsize=(10, 5))
plt.plot(range(2, 20), distortions)
plt.grid(True)
plt.title('Elbow curve')
plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2Fu5LFtktRzPQD0SE8ShCU%2Ffile.png?alt=media)

* K-Means의 k를 설정하기 위해 Elbow curve를 그려보았다.
* k=5 일때 급격한 distortions의 변화가 일어난 것으로 보아 cluster 수를 5로 설정하기로 하였다.

In \[16]:

```python
reduced_df = pd.DataFrame(reduced_df)

km = KMeans(n_clusters=5, init='k-means++')
cluster = km.fit(reduced_df)
cluster_id = pd.DataFrame(cluster.labels_)

d1 = pd.concat([reduced_df, cluster_id], axis=1)
d1.columns = [0, 1, "cluster"]

sns.scatterplot(d1[0], d1[1], hue = d1['cluster'], legend="full")
sns.scatterplot(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], label = 'Centroids')
plt.title("KMeans Clustering")
plt.legend()
plt.show()

print('Silhouette Coefficient: {:.4f}'.format(metrics.silhouette_score(d1.iloc[:,:-1], d1['cluster'])))
print('Davies Bouldin Index: {:.4f}'.format(metrics.davies_bouldin_score(d1.iloc[:,:-1], d1['cluster'])))
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2F0kcotECNXOWY7gTjiPSi%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.5526
Davies Bouldin Index: 0.5843
```

### 2. DBScan

In \[17]:

```python
eps = [2,5,7,8,10]
for i in eps:
    db = DBSCAN(eps=i, min_samples=4)
    cluster = db.fit(reduced_df)
    cluster_id = pd.DataFrame(cluster.labels_)
    
    d2 = pd.DataFrame()
    d2 = pd.concat([reduced_df,cluster_id],axis=1)
    d2.columns = [0, 1, "cluster"]
    
    sns.scatterplot(d2[0], d2[1], hue = d2['cluster'], legend="full")
    plt.title('DBScan with eps {}'.format(i))
    plt.show()
    
    print('Silhouette Coefficient: {:.4f}'.format(metrics.silhouette_score(d2.iloc[:,:-1], d2['cluster'])))
    print('Davies Bouldin Index: {:.4f}'.format(metrics.davies_bouldin_score(d2.iloc[:,:-1], d2['cluster'])))
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FtoYL2EdOYzsXJTlr0sBL%2Ffile.png?alt=media)

```python
Silhouette Coefficient: -0.3259
Davies Bouldin Index: 9.0856
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FCqmTZu00d0rRvNai7gIs%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.0896
Davies Bouldin Index: 2.1779
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FXbWRcKHXk3IqaL7lmC9Z%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.3060
Davies Bouldin Index: 2.1851
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2F2KYLd2FuTEXXE5DOTcoA%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.3944
Davies Bouldin Index: 2.0000
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FLjXfrn6NqxAk804zTvp2%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.3124
Davies Bouldin Index: 1.9782
```

In \[18]:

```python
ss = StandardScaler()
scaled_df = pd.DataFrame(ss.fit_transform(reduced_df))

eps = [0.1, 0.2, 0.3, 0.4, 0.5]
for i in eps:
    db = DBSCAN(eps=i, min_samples=4)
    cluster = db.fit(scaled_df)
    cluster_id = pd.DataFrame(cluster.labels_)
    
    d3 = pd.DataFrame()
    d3 = pd.concat([scaled_df,cluster_id],axis=1)
    d3.columns = [0, 1, "cluster"]
    
    sns.scatterplot(d3[0], d3[1], hue = d3['cluster'], legend="full")
    plt.title('DBScan with eps {}'.format(i))
    plt.show()
    
    print('Silhouette Coefficient: {:.4f}'.format(metrics.silhouette_score(d3.iloc[:,:-1], d3['cluster'])))
    print('Davies Bouldin Index: {:.4f}'.format(metrics.davies_bouldin_score(d3.iloc[:,:-1], d3['cluster'])))
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FDGlVna9Jk4i5pVYfmdtu%2Ffile.png?alt=media)

```python
Silhouette Coefficient: -0.3248
Davies Bouldin Index: 6.5649
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2Fug1hYRchvIaSQnNsjw7u%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.0994
Davies Bouldin Index: 2.2154
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FdqUYnW9VYZKwUXah3SFv%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.3939
Davies Bouldin Index: 2.0034
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FK7EowFlI6D4GXImKSWB3%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.3298
Davies Bouldin Index: 1.0988
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FCYuSvkHRoortjjp6zGg5%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.3451
Davies Bouldin Index: 0.8144
```

* standard scaler로 스케일링을 진행한 데이터에 대해서 DBScan으로 군집을 확인해보았다.
* eps가 변화하면서 생성되는 군집의 변화를 알 수 있다.
* Scaled data와 Not Scaled data 모두 Silhouette Coefficient가 0.5를 넘지 않아 좋은 결과를 보이고 있지는 않은 것 같다.
* Not Scaled Data는 ep가 8일 때, Scaled Data는 ep가 0.3일 때가 가장 적절해 보인다.

### 3. Hierarchical agglomerative clustering

In \[19]:

```python
linkage_array = ward(reduced_df)
dendrogram(linkage_array)
plt.xlabel("Sample Num")
plt.ylabel("Cluster Dist")

# 클러스터를 구분하는 커트라인을 표시
ax = plt.gca()
bounds = ax.get_xbound()
ax.plot(bounds, [350, 350], '--', c='k')
ax.plot(bounds, [200, 200], '--', c='k')
ax.text(bounds[1], 350, ' 3 Clusters ', va='center', fontdict={'size': 10})
ax.text(bounds[1], 200, ' 5 Clusters ', va='center', fontdict={'size': 10})
plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FYHBCDVx6M1UzCvTJryfu%2Ffile.png?alt=media)

* 5개의 cluster보다 많아질 경우 Cluster간 거리가 급격히 줄어드는 것으로 보아 5개의 Cluster로 자르는 것이 적절해 보인다.

### 4. Agglomerative Clustering

In \[20]:

```python
n = [2,3,5,7,9]
for i in n:
    agg = AgglomerativeClustering(n_clusters=i)
    cluster = agg.fit(scaled_df)
    cluster_id = pd.DataFrame(cluster.labels_)
    
    d4 = pd.DataFrame()
    d4 = pd.concat([scaled_df,cluster_id],axis=1)
    d4.columns = [0, 1, "cluster"]
    
    sns.scatterplot(d4[0], d4[1], hue = d4['cluster'], legend="full")
    plt.title('Agglomerative with {} clusters'.format(i))
    plt.show()
    
    print('Silhouette Coefficient: {:.4f}'.format(metrics.silhouette_score(d4.iloc[:,:-1], d4['cluster'])))
    print('Davies Bouldin Index: {:.4f}'.format(metrics.davies_bouldin_score(d4.iloc[:,:-1], d4['cluster'])))
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2F5JxzeVKTzTGqquureyzB%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.3794
Davies Bouldin Index: 0.8542
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2F60eS7GNJhO3XuSYnX1l2%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.4493
Davies Bouldin Index: 0.7163
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FoitmhqcjQpkm6alEJuty%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.5473
Davies Bouldin Index: 0.5902
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FC5SEoou0CDRdmErujjEH%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.4436
Davies Bouldin Index: 0.7269
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FeuZOCE8k5gMTJ7Jl6hMQ%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.4532
Davies Bouldin Index: 0.6974
```

* Silhouette Coefficient 계수가 가장 큰 것은 5개의 클러스터로 나눈 경우이다.
* Davies Bouldin Index가 가장 작은 것 또한 5개의 클러스터로 나눈 경우이다.
* 따라서 5개의 클러스터로 나눈 경우가 가장 적절해 보인다.

### 5. Affinity Propagation

* 모든 데이터가 특정한 기준에 따라 자신을 대표할 대표 데이터를 선택한다.
* 만약 스스로가 자기 자신을 대표하게 되면 클러스터의 중심이 된다.\
  참고 : <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation>

In \[21]:

```python
ap = AffinityPropagation()
cluster = ap.fit(scaled_df)
cluster_id = pd.DataFrame(cluster.labels_)

d5 = pd.DataFrame()
d5 = pd.concat([scaled_df,cluster_id],axis=1)
d5.columns = [0, 1, "cluster"]

sns.scatterplot(d5[0], d5[1], hue = d5['cluster'], legend="full")
plt.title('Affinity Propagation {} clusters'.format(len(d5.cluster.unique())))
plt.show()

print('Silhouette Coefficient: {:.4f}'.format(metrics.silhouette_score(d5.iloc[:,:-1], d5['cluster'])))
print('Davies Bouldin Index: {:.4f}'.format(metrics.davies_bouldin_score(d5.iloc[:,:-1], d5['cluster'])))
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FzRyWDp0Hy6cnyGuVObrz%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.4319
Davies Bouldin Index: 0.7753
```

### 6. Mean Shift

* 각 점들에 대해 데이터의 분포에서 mode를 찾아 이동하다보면 점들이 적당하게 모일 것이라는 것이라는 아이디어
* bandwidth : 얼마나 관대하게 봐줄지를 설정하는 폭\
  참고 : <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html#sklearn.cluster.MeanShift>

In \[22]:

```python
n = [10, 50, 100]    
for i in n:
    bandwidth = estimate_bandwidth(scaled_df, quantile=0.2, n_samples=i)
    ms = MeanShift(bandwidth=bandwidth)
    cluster = ms.fit(scaled_df)
    cluster_id = pd.DataFrame(cluster.labels_)

    d6 = pd.DataFrame()
    d6 = pd.concat([scaled_df,cluster_id],axis=1)
    d6.columns = [0, 1, "cluster"]

    sns.scatterplot(d6[0], d6[1], hue = d6['cluster'], legend="full")
    plt.title('Mean Shift with {} samples'.format(i))
    plt.show()

    print('Silhouette Coefficient: {:.4f}'.format(metrics.silhouette_score(d6.iloc[:,:-1], d6['cluster'])))
    print('Davies Bouldin Index: {:.4f}'.format(metrics.davies_bouldin_score(d6.iloc[:,:-1], d6['cluster'])))
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FMQECPC5txS3aseIs4LlX%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.5349
Davies Bouldin Index: 0.5895
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FOkNahL7G2eqp91OeYxLf%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.4852
Davies Bouldin Index: 0.7207
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FU2CIiF9md6ZtcFfyI1GX%2Ffile.png?alt=media)

```python
Silhouette Coefficient: 0.5421
Davies Bouldin Index: 0.5857
```

* sample의 수가 100개일 때 가장 좋은 성능을 보이고 있다.

## Evaluation <a href="#evaluation" id="evaluation"></a>

### **No Scaled Data**

In \[23]:

```python
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 4), subplot_kw={'xticks': (), 'yticks': ()})
fig.suptitle('Clustering Algorithms (No Scaled)', fontsize=15)

algorithms = [KMeans(n_clusters=5, init='k-means++'), 
              DBSCAN(eps=8, min_samples=4),
              AgglomerativeClustering(n_clusters=5),
              AffinityPropagation(),
              MeanShift(bandwidth=estimate_bandwidth(reduced_df, quantile=0.2, n_samples=100))
              ]

for ax, algorithm in zip(axes.flatten(), algorithms):
    clusters = algorithm.fit_predict(reduced_df)
    sns.scatterplot(reduced_df.loc[:, 0], reduced_df.loc[:, 1], hue=clusters, ax=ax)
    ax.set(title="{} : {:.4f}".format(algorithm.__class__.__name__,
                                      silhouette_score(reduced_df, clusters)))
plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FNusuX1cSMdmBbKIGhDY5%2Ffile.png?alt=media)

### **Scaled Data**

In \[24]:

```python
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(20, 4), subplot_kw={'xticks': (), 'yticks': ()})
fig.suptitle('Clustering Algorithms (Scaled)', fontsize=15)

algorithms = [KMeans(n_clusters=5, init='k-means++'), 
              DBSCAN(eps=0.3, min_samples=4),
              AgglomerativeClustering(n_clusters=5),
              AffinityPropagation(),
              MeanShift(bandwidth=estimate_bandwidth(scaled_df, quantile=0.2, n_samples=100))
              ]

for ax, algorithm in zip(axes.flatten(), algorithms):
    clusters = algorithm.fit_predict(scaled_df)
    sns.scatterplot(scaled_df.loc[:, 0], scaled_df.loc[:, 1], hue=clusters, ax=ax)
    ax.set(title="{} : {:.4f}".format(algorithm.__class__.__name__,
                                      silhouette_score(scaled_df, clusters)))
plt.show()
```

![](https://firebasestorage.googleapis.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lzv9WQqVErrkv4TUmw2%2Fuploads%2FMHLwQuKvUbGaJx3brYmi%2Ffile.png?alt=media)

* 마지막으로 모든 알고리즘을 한 번에 시각화해 보았다.
* 실루엣 계수에 의하면 K Means Clustering이 가장 최적화된 군집을 생성해냈다고 할 수 있다.
* Scaled Data와 Not Scaled Data를 비교해보았을 때, Affinity Propagation과 Meanshift 알고리즘에서는 Scaling 된 데이터가 더 잘 군집을 생성해냈다.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tobigs.gitbook.io/tobigs/data-analysis/undefined-3/python-2-2.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
