Python을 이용한 실전 머신러닝 (2)

%matplotlib inline

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
plt.rc('font', family='Malgun Gothic')
plt.rc('axes', unicode_minus=False)

import warnings
warnings.filterwarnings('ignore')

In [2]:

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.linear_model import ElasticNet, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

import xgboost as xgb
import lightgbm as lgb

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error

1. Data Description

출처: Dacon 13회 제주 퇴근시간 버스승차인원 예측

평가지표: RMSE

In [3]:

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
bus = pd.read_csv("bus_bts.csv")

1. train.csv / test.csv

train: 2019-09-01 ~ 2019-09-30 / test: 2019-10-01 ~ 2019-10-16

  • id : id 해당 데이터에서의 고유한 ID(train, test와의 중복은 없음)

  • data : 날짜

  • bus_route_id : 노선 ID

  • in_out : 시내버스, 시외버스 구분

  • station_code : 승하차 정류소 ID

  • station_name : 승하차 정류소 이름

  • lattitude : 위도 (같은 정류장 이름이어도 버스의 진행 방향에 따라 다를 수 있음)

  • logitude : 경도 (같은 정류장 이름이어도 버스의 진행 방향에 따라 다를 수 있음)

  • h~h+1_ride : 해당 시간 사이 승차한 인원수

  • h~h+1_takeoff : 해당 시간 사이 하차한 인원수

  • 18~20_ride : 해당시간 사이 승차한 인원수 (target variable)

In [4]:

def display_data(data, num):
    with pd.option_context('display.max_rows', None, 'display.max_columns', None): # temporarily unleash limit
        print('dataset shape is: {}'.format(data.shape))
        display(data.head(num).append(data.tail(num)))
        print("Number of missing values:\n",data.isnull().sum(),"\n")

In [5]:

display_data(train, 5)
dataset shape is: (415423, 21)
Number of missing values:
 id               0
date             0
bus_route_id     0
in_out           0
station_code     0
station_name     0
latitude         0
longitude        0
6~7_ride         0
7~8_ride         0
8~9_ride         0
9~10_ride        0
10~11_ride       0
11~12_ride       0
6~7_takeoff      0
7~8_takeoff      0
8~9_takeoff      0
9~10_takeoff     0
10~11_takeoff    0
11~12_takeoff    0
18~20_ride       0
dtype: int64 

In [6]:

display_data(test, 5)
dataset shape is: (228170, 20)
Number of missing values:
 id               0
date             0
bus_route_id     0
in_out           0
station_code     0
station_name     0
latitude         0
longitude        0
6~7_ride         0
7~8_ride         0
8~9_ride         0
9~10_ride        0
10~11_ride       0
11~12_ride       0
6~7_takeoff      0
7~8_takeoff      0
8~9_takeoff      0
9~10_takeoff     0
10~11_takeoff    0
11~12_takeoff    0
dtype: int64 

In [8]:

figure, (ax1, ax2) = plt.subplots(ncols=2)
figure.set_size_inches(12,4)

sns.scatterplot( x = 'longitude', y = 'latitude', data = train, alpha = 0.1, ax=ax1)
sns.scatterplot( x = 'longitude', y = 'latitude', data = test, alpha = 0.1, ax=ax2)

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x180c98fb128>

왼쪽 위의 값이 이상치처럼 보이지만, test data에도 이러한 값이 존재하므로 제거하지 않기로 한다

2. bus_bts.csv

2019-09-01 ~ 2019-10-16

  • user_card_id: 승객의 버스카드ID

  • bus_route_id: 노선ID

  • vhc_id: 차량ID

  • geton_date: 승객이 탑승한 날짜

  • geton_time: 승객이 탑승한 시간

  • geton_station_code: 승차정류소의 ID

  • geton_station_name: 승차정류소의 이름

  • getoff_date: 해당 승객이 하차한 날짜 (하차태그 없는 경우, NaN)

  • getoff_time: 해당 승객이 하차한 시간 (하차태그 없는 경우, NaN)

  • getoff_station_code: 하차정류소의 ID (하차태그 없는 경우, NaN)

  • getoff_station_name: 하차정류소의 이름 (하차태그 없는 경우, NaN)

  • user_category: 승객 구분 (01-일반, 02-어린이, 04-청소년, 06-경로, 27-장애 일반, 28-장애 동반, 29-유공 일반, 30-유공 동반)

  • user_count: 해당 버스카드로 계산한 인원수 ( ex- 3은 3명 분의 버스비를 해당 카드 하나로 계산한 것)

In [9]:

display_data(bus, 5)
dataset shape is: (2409414, 13)
Number of missing values:
 user_card_id                0
bus_route_id                0
vhc_id                      0
geton_date                  0
geton_time                  0
geton_station_code          0
geton_station_name         49
getoff_date            895736
getoff_time            895736
getoff_station_code    895736
getoff_station_name    895775
user_category               0
user_count                  0
dtype: int64 

bus 데이터는 train, test 날짜의 데이터를 모두 가지고 있으므로, 날짜에 맞춰서 split 해 주어야 한다In [10]:

bus = bus.sort_values('geton_date').reset_index().drop(['index'], axis = 1)

In [11]:

display_data(bus.loc[bus["geton_date"] == "2019-10-01"], 1)
dataset shape is: (62682, 13)
Number of missing values:
 user_card_id               0
bus_route_id               0
vhc_id                     0
geton_date                 0
geton_time                 0
geton_station_code         0
geton_station_name         1
getoff_date            24861
getoff_time            24861
getoff_station_code    24861
getoff_station_name    24861
user_category              0
user_count                 0
dtype: int64 

In [12]:

bus_train = bus.loc[:1548758]
bus_test = bus.loc[1548759:]

In [13]:

bus_train.shape, bus_test.shape

Out[13]:

((1548759, 13), (860655, 13))

3. target distribution

In [14]:

train['18~20_ride'].describe()

Out[14]:

count    415423.000000
mean          1.242095
std           4.722287
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max         272.000000
Name: 18~20_ride, dtype: float64

In [15]:

figure, (ax1, ax2) = plt.subplots(ncols=2)
figure.set_size_inches(12,4)

sns.boxplot(train[["18~20_ride"]], ax=ax1)
sns.distplot(train[["18~20_ride"]],ax=ax2)

Out[15]:

<matplotlib.axes._subplots.AxesSubplot at 0x180c9a3ccc0>
  • 예측해야 할 target 변수의 값이 0에 몰려있음을 알 수 있다

  • 승차인원 예측 데이터의 경우 선형회귀를 진행하기보다는 부스팅 모델을 사용하는 것이 더 나아 보이므로, 로그변환은 하지 않기로 결정한다

2. Data Preprocessing

1. 날짜

1) datetime 형태로 바꿔주기

In [16]:

train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])

In [17]:

bus_train['geton_date'] = pd.to_datetime(bus_train['geton_date'])
bus_test['geton_date'] = pd.to_datetime(bus_test['geton_date'])
bus_train['getoff_date'] = pd.to_datetime(bus_train['getoff_date'])
bus_test['geton_date'] = pd.to_datetime(bus_test['geton_date'])

2) 날짜 변수 생성: day

In [18]:

train['day'] = pd.to_datetime(train['date']).dt.day
test['day'] = pd.to_datetime(test['date']).dt.day

3) 주말 변수 생성: 월~금 1, 토~일 0

In [19]:

# datetime weekday: 월요일 0 ~ 일요일 6

train['weekday'] = train['date'].dt.weekday
test['weekday'] = test['date'].dt.weekday

In [20]:

# 주말 변수 생성: 월~금 1, 토~일 0

train['weekend'] = train['weekday'].map(lambda x : 1 if x<=5 else 0)
test['weekend'] = test['weekday'].map(lambda x : 1 if x<=5 else 0)

In [25]:

figure, axes = plt.subplots(ncols=2)
figure.set_size_inches(12,5)

sns.countplot(train['weekend'], ax=axes[0])
sns.boxplot(x= train['weekend'], y=train["18~20_ride"], ax=axes[1])

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x180c999e4e0>

4) 공휴일 변수 생성 (적용되지 않음 ㅠㅠ)

In [21]:

def holiday(x): # 왜 안될까 
    if x in ['2019-09-12','2019-09-13','2019-10-03','2019-10-09']:
        return 1
    else:
        return 0

In [22]:

train['holiday'] = train['date'].apply(holiday) 
test['holiday'] = test['date'].apply(holiday)

In [24]:

train['holiday'].value_counts()

Out[24]:

0    415423
Name: holiday, dtype: int64

In [ ]:

def holiday9(x): # 2019-09-12, 2019-09-13
    if x in [12, 13]:
        return 1
    else:
        return 0

In [ ]:

def holiday10(x): # 2019-10-03, 2019-10-09
    if x in [3, 9]:
        return 1
    else:
        return 0

In [ ]:

train['holiday'] = train['day'].apply(holiday9) 
test['holiday'] = test['day'].apply(holiday10)

2. 버스 타입

시내 / 시외버스: dummy 변수화

In [26]:

train['in_out'] = train['in_out'].map({'시내':0,'시외':1})
test['in_out'] = test['in_out'].map({'시내':0,'시외':1})

In [27]:

figure, axes = plt.subplots(ncols=2)
figure.set_size_inches(12,5)

sns.countplot(train['in_out'], ax=axes[0])
sns.boxplot(x= train['in_out'], y=train["18~20_ride"], ax=axes[1])

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x1808014abe0>

3. bus_route_id

train/test data와 bus_bts data에서 겹치는 feature는 bus_route_id (노선ID) 이다. 이를 기준으로 새로운 변수를 생성하고, train/test data와 bus data를 합친다.

1) train/test 데이터와 bus_bts 데이터의 노선아이디의 unique 갯수가 같은 지 확인한다

In [28]:

df = pd.concat([train, test], sort=False)
len(df['bus_route_id'].unique())

Out[28]:

631

In [29]:

len(bus['bus_route_id'].unique())

Out[29]:

630

1개가 다른걸 알 수 있으며, bus_bts가 가지고 있지 않은 노선번호를 확인한다In [30]:

no_train_route = []

for i in train['bus_route_id'].unique():
      if i not in bus_train['bus_route_id'].unique():
            no_train_route.append(i)

print(len(no_train_route), no_train_route)
1 [30950000]

In [31]:

no_test_route = []

for i in test['bus_route_id'].unique():
      if i not in bus_test['bus_route_id'].unique():
            no_test_route.append(i)

print(len(no_test_route), no_test_route)
1 [31120000]

In [33]:

display(test.loc[test['bus_route_id']==31120000])

Out[33]:

1 rows × 24 columns

2) user_category 기준으로 새로운 feature 생성하기

각각 bus_route_id 에서, 어떤 타입의 탑승객이 몇 명씩 버스를 타고 내렸는지에 대한 변수를 생성한다

In [34]:

bus_cate_train = bus_train[['bus_route_id', 'geton_date', 'geton_station_code', 'user_category']]
bus_cate_train.head()

Out[34]:

In [35]:

bus_cate_train = pd.get_dummies(bus_train, columns=['user_category']) # user type 각각에 대한 dummy 변수 생성

In [36]:

bus_train_group = bus_cate_train.groupby(['bus_route_id']).sum().reset_index() # 각각의 bus_route에서, 타입별로 몇 명씩 승하차했는지 계산

In [37]:

bus_train_group.head()

Out[37]:

In [38]:

# test에도 적용 
bus_cate_test = bus_test[['bus_route_id', 'geton_date', 'geton_station_code', 'user_category']]

In [39]:

bus_cate_test = pd.get_dummies(bus_test, columns=['user_category'])

In [40]:

bus_test_group = bus_cate_test.groupby(['bus_route_id']).sum().reset_index()

In [41]:

bus_test_group.head()

Out[41]:

In [42]:

bus_train_group.shape, bus_test_group.shape

Out[42]:

((612, 14), (600, 14))

In [43]:

train = pd.merge(train, bus_train_group, on = 'bus_route_id', how = 'left')
test = pd.merge(test, bus_test_group, on = 'bus_route_id', how='left')

In [44]:

train.shape, test.shape

Out[44]:

((415423, 38), (228170, 37))

In [45]:

display_data(train,5)
dataset shape is: (415423, 38)
Number of missing values:
 id                     0
date                   0
bus_route_id           0
in_out                 0
station_code           0
station_name           0
latitude               0
longitude              0
6~7_ride               0
7~8_ride               0
8~9_ride               0
9~10_ride              0
10~11_ride             0
11~12_ride             0
6~7_takeoff            0
7~8_takeoff            0
8~9_takeoff            0
9~10_takeoff           0
10~11_takeoff          0
11~12_takeoff          0
18~20_ride             0
day                    0
weekday                0
weekend                0
holiday                0
user_card_id           2
vhc_id                 2
geton_station_code     2
getoff_station_code    2
user_count             2
user_category_1        2
user_category_2        2
user_category_4        2
user_category_6        2
user_category_27       2
user_category_28       2
user_category_29       2
user_category_30       2
dtype: int64 

In [74]:

# 결측값의 경우 train/test data에 bus data의 노선 번호가 존재하지 않는 경우 발생한다 
# 이 때의 값을 살펴보면, 대부분 ride, takeoff 값이 0이므로 그냥 0으로 채워주기로 한다  
train = train.fillna(0) 
test = test.fillna(0)

4. bus_route_id / station_code / station_name

  • 새로운 feature 생성

In [47]:

print('bus_route_id unique : {}'.format(len(train['bus_route_id'].unique())))
print('station_code unique : {}'.format(len(train['station_code'].unique())))
print('station_name unique : {}'.format(len(train['station_name'].unique())))
bus_route_id unique : 613
station_code unique : 3563
station_name unique : 1961

1) bus_route_id

bus_route_id(노선ID)를 그룹화하여, 18~20_ride 값의 평균을 구해 새로운 변수로 생성한다In [48]:

train_bus_route = train[['18~20_ride','bus_route_id']].groupby('bus_route_id').mean().sort_values('18~20_ride').reset_index()

In [49]:

train = pd.merge(train, train_bus_route, on = 'bus_route_id', how = 'left')
test = pd.merge(test, train_bus_route, on = 'bus_route_id', how='left')

In [50]:

train.rename(columns = {'18~20_ride_x' : '18~20_ride', '18~20_ride_y' : 'bus_route_mean'}, inplace = True)
test.rename(columns = {'18~20_ride' : 'bus_route_mean'}, inplace = True)

In [51]:

#test 데이터에 NaN값이 생기는데, 이것은 test데이터에 같은 key가 없으면 NaN값으로 대체된다.
#따라서, 그 경우 우선 train의 중앙값을 가지고와 test의 변수에 대체해준다.

test['bus_route_mean'].fillna(train['bus_route_mean'].median(),inplace = True)

2) station_code

station_code(승하차정류소 ID)를 그룹화하여, 18~20_ride 값의 평균을 구해 새로운 변수로 생성한다In [52]:

train_st_code = train[['18~20_ride','station_code']].groupby('station_code').mean().sort_values('18~20_ride').reset_index()

In [53]:

train = pd.merge(train, train_st_code, on = 'station_code', how = 'left')
test = pd.merge(test, train_st_code, on = 'station_code', how='left')

In [54]:

train.rename(columns = {'18~20_ride_x' : '18~20_ride', '18~20_ride_y' : 'station_code_mean'}, inplace = True)
test.rename(columns = {'18~20_ride' : 'station_code_mean'}, inplace = True)

In [55]:

# 위의 경우와 마찬가지로 train의 중앙값을 가지고와 test의 변수에 대체해준다.
test['station_code_mean'].fillna(train['station_code_mean'].median(),inplace = True)

3) station_name

station_name(승하차정류소 이름)을 그룹화하여, 18~20_ride 값의 평균을 구해 새로운 변수로 생성한다 station_code의 경우와 비슷한 값을 가지는 변수가 생성될 것이다In [56]:

train_name_code = train[['18~20_ride','station_name']].groupby('station_name').mean().sort_values('18~20_ride').reset_index()

In [57]:

train = pd.merge(train, train_name_code, on = 'station_name', how = 'left')
test = pd.merge(test, train_name_code, on = 'station_name', how='left')

In [58]:

train.rename(columns = {'18~20_ride_x' : '18~20_ride', '18~20_ride_y' : 'station_name_mean'}, inplace = True)
test.rename(columns = {'18~20_ride' : 'station_name_mean'}, inplace = True)

In [59]:

# 위의 경우와 마찬가지로 train의 중앙값을 가지고와 test의 변수에 대체해준다.
test['station_name_mean'].fillna(train['station_name_mean'].median(),inplace = True)

5. station_name

  • 수요가 많을 것으로 예상되는 정류장

1) 학교

In [62]:

g = df[df['station_name'].str.contains('고등학교')]
highschool = list(g['station_name'].unique())

g = df[df['station_name'].str.contains('대학교')]
university = list(g['station_name'].unique())

In [63]:

def school(x):
    if x in highschool:
        return 1
    elif x in university:
        return 1
    else:
        return 0

In [64]:

train['school'] = train['station_name'].apply(school) 
test['school'] = test['station_name'].apply(school)

2) 공항, 환승, 터미널

In [65]:

g = df[df['station_name'].str.contains('환승')]
transfer = list(g['station_name'].unique())

g = df[df['station_name'].str.contains('공항')]
airport = list(g['station_name'].unique())

g = df[df['station_name'].str.contains('터미널')]
terminal = list(g['station_name'].unique())

In [66]:

def station(x):
    if x in transfer:
        return 1
    elif x in airport:
        return 1
    elif x in terminal:
        return 1
    else:
        return 0

In [67]:

train['station'] = train['station_name'].apply(station) 
test['station'] = test['station_name'].apply(station)

In [75]:

display_data(train, 3)
dataset shape is: (415423, 43)
Number of missing values:
 id                     0
date                   0
bus_route_id           0
in_out                 0
station_code           0
station_name           0
latitude               0
longitude              0
6~7_ride               0
7~8_ride               0
8~9_ride               0
9~10_ride              0
10~11_ride             0
11~12_ride             0
6~7_takeoff            0
7~8_takeoff            0
8~9_takeoff            0
9~10_takeoff           0
10~11_takeoff          0
11~12_takeoff          0
18~20_ride             0
day                    0
weekday                0
weekend                0
holiday                0
user_card_id           0
vhc_id                 0
geton_station_code     0
getoff_station_code    0
user_count             0
user_category_1        0
user_category_2        0
user_category_4        0
user_category_6        0
user_category_27       0
user_category_28       0
user_category_29       0
user_category_30       0
bus_route_mean         0
station_code_mean      0
station_name_mean      0
school                 0
station                0
dtype: int64 

6. 최종 전처리

1) 범주형 변수: LabelEncoding

In [76]:

cols = ('bus_route_id','station_code', "geton_station_code", "getoff_station_code")

for col in cols: 
    le = LabelEncoder()
    le.fit(list(train[col].values))
    train[col] = le.transform(list(train[col].values))    
    
    le.fit(list(test[col].values))
    test[col] = le.transform(list(test[col].values))

    
print(train.shape, test.shape)
(415423, 43) (228170, 42)

2) id 변수: drop features

In [80]:

dropfeatures = ["id", "date", "station_name", "user_card_id", "vhc_id", 
               "latitude", "longitude"]

In [81]:

train = train.drop(dropfeatures, axis=1)
test = test.drop(dropfeatures, axis=1)

In [82]:

display_data(train, 3)
dataset shape is: (415423, 36)
Number of missing values:
 bus_route_id           0
in_out                 0
station_code           0
6~7_ride               0
7~8_ride               0
8~9_ride               0
9~10_ride              0
10~11_ride             0
11~12_ride             0
6~7_takeoff            0
7~8_takeoff            0
8~9_takeoff            0
9~10_takeoff           0
10~11_takeoff          0
11~12_takeoff          0
18~20_ride             0
day                    0
weekday                0
weekend                0
holiday                0
geton_station_code     0
getoff_station_code    0
user_count             0
user_category_1        0
user_category_2        0
user_category_4        0
user_category_6        0
user_category_27       0
user_category_28       0
user_category_29       0
user_category_30       0
bus_route_mean         0
station_code_mean      0
station_name_mean      0
school                 0
station                0
dtype: int64 

In [83]:

display_data(test, 3)
dataset shape is: (228170, 35)
Number of missing values:
 bus_route_id           0
in_out                 0
station_code           0
6~7_ride               0
7~8_ride               0
8~9_ride               0
9~10_ride              0
10~11_ride             0
11~12_ride             0
6~7_takeoff            0
7~8_takeoff            0
8~9_takeoff            0
9~10_takeoff           0
10~11_takeoff          0
11~12_takeoff          0
day                    0
weekday                0
weekend                0
holiday                0
geton_station_code     0
getoff_station_code    0
user_count             0
user_category_1        0
user_category_2        0
user_category_4        0
user_category_6        0
user_category_27       0
user_category_28       0
user_category_29       0
user_category_30       0
bus_route_mean         0
station_code_mean      0
station_name_mean      0
school                 0
station                0
dtype: int64 

3. Modeling

0. LGBM (Dacon SCORE: 2.58261)

  • 일단 부스팅 모델에 적합해보기

In [84]:

X_train = train.drop(["18~20_ride"], axis=1)
X_test = test
y_train = train["18~20_ride"]

In [85]:

X_train.shape, X_test.shape, y_train.shape

Out[85]:

((415423, 35), (228170, 35), (415423,))

In [140]:

#Validation function
n_folds = 5

def rmse_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_train.values)
    rmse= np.sqrt(-cross_val_score(model, X_train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

In [92]:

grid_params = {
    'n_estimators': [200, 500, 1000],
    'learning_rate' : [0.025, 0.05],
    'num_leaves': [3, 4, 5],
    'min_child_samples' : [40, 60],
    'subsample' : [ 0.6, 0.8 ]
    }

In [93]:

gs = GridSearchCV(
    lgb.LGBMRegressor(),
    grid_params,
    verbose = 10,
    cv = 3,
    n_jobs = -1
    )

In [94]:

gs_results = gs.fit(X_train, y_train, verbose=True)
Fitting 3 folds for each of 72 candidates, totalling 216 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   45.5s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed: 12.2min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed: 13.3min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed: 15.4min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed: 18.3min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed: 24.6min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 26.7min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed: 29.9min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 36.6min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed: 41.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 44.4min
[Parallel(n_jobs=-1)]: Done 216 out of 216 | elapsed: 53.8min finished

In [95]:

print("Best Parameter: {}".format(gs.best_params_))
Best Parameter: {'learning_rate': 0.05, 'min_child_samples': 40, 'n_estimators': 1000, 'num_leaves': 5, 'subsample': 0.6}

In [97]:

model_lgb = lgb.LGBMRegressor(learning_rate=0.05, min_child_samples=40, n_estimators=1000, num_leaves=5, subsample=0.6)

In [98]:

# X_train만 넣어서 모델 성능 평가해야 하는데... y값이 포함된 train 데이터셋을 넣어서 성능이 매우 좋게 나왔습니다... 
# 다시 할 엄두가 나지 않아서......... 기회가 된다면 다시 해 보겠습니다...

score = rmse_cv(model_lgb) 
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))
LGBM score: 0.4206 (0.1550)

In [100]:

model_lgb.fit(X_train.values, y_train)
lgb_train_pred = model_lgb.predict(X_train.values)
lgb_pred = model_lgb.predict(X_test.values)

In [102]:

sub = pd.read_csv('submission_sample.csv')
sub['18~20_ride'] = lgb_pred
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다 
sub.to_csv('submission.csv', index=False)
  • 데이터가 너무 많아서 grid search하는 데에 오래 걸리므로, 랜덤으로 20%만 뽑아서 최적의 hyperparameter를 찾아보기로 한다

  • LGBM 외에도 다른 괜찮은 모델이 있는지 찾아보기로 한다

In [104]:

train2= train.sample(frac = 0.2, random_state=28)
test2 = test.sample(frac = 0.2, random_state=28)

train2.shape, test2.shape

Out[104]:

((83085, 36), (45634, 35))

In [105]:

X_train2 = train2.drop(["18~20_ride"], axis=1)
X_test2 = test2
y_train2 = train2["18~20_ride"]

In [106]:

X_train2.shape, X_test2.shape, y_train2.shape

Out[106]:

((83085, 35), (45634, 35), (83085,))

1) Gradient Boosting Regression

In [111]:

grid_params = {
    'loss': ['huber'], 'learning_rate': [0.02, 0.05], 'n_estimators': [500, 1000], 
    'max_depth':[3, 5], 'min_samples_leaf':[2, 3], 'subsample' : [0.6, 0.8]
}

In [112]:

gs = RandomizedSearchCV(
    GradientBoostingRegressor(),
    grid_params,
    verbose = 10,
    cv = 3,
    n_jobs = -1
    )

In [113]:

gs_results = gs.fit(X_train2, y_train2)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  9.7min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed: 20.7min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 31.3min
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed: 43.0min remaining:  4.8min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 45.2min finished

In [114]:

print("Best Parameter: {}".format(gs.best_params_))
Best Parameter: {'subsample': 0.8, 'n_estimators': 1000, 'min_samples_leaf': 3, 'max_depth': 5, 'loss': 'huber', 'learning_rate': 0.05}

2) XGBoost

In [115]:

grid_params = {
 'n_estimators': [ 500, 1000 ],
 "learning_rate"    : [ 0.02, 0.05 ] ,
 "max_depth"        : [ 3, 5 ],
 "min_child_weight" : [ 2, 4 ],
 "gamma"            : [ 0.05 ],
 'subsample' : [ 0.6, 0.8 ]}

In [116]:

gs = RandomizedSearchCV(
    xgb.XGBRegressor(),
    grid_params,
    verbose = 10,
    cv = 3,
    n_jobs = -1
    )

In [117]:

gs_results = gs.fit(X_train2, y_train2)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed: 12.4min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed: 14.3min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 21.8min
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed: 28.4min remaining:  3.2min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 30.6min finished
[18:56:47] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

In [118]:

print("Best Parameter: {}".format(gs.best_params_))
Best Parameter: {'subsample': 0.8, 'n_estimators': 1000, 'min_child_weight': 4, 'max_depth': 5, 'learning_rate': 0.05, 'gamma': 0.05}

3) Ridge

In [119]:

#Validation function
n_folds = 5

def rmse_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_train2.values)
    rmse= np.sqrt(-cross_val_score(model, X_train2.values, y_train2, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

In [120]:

alphas = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 10, 15, 30]

for alpha in alphas: 
    ridge = make_pipeline(RobustScaler(), Ridge(alpha = alpha, random_state=1))
    score = rmse_cv(ridge)
    print("\n (alpha = {} ) Ridge score: {:.4f} ({:.4f})".format(alpha, score.mean(), score.std()))
 (alpha = 0.0001 ) Ridge score: 3.4377 (0.6355)

 (alpha = 0.0005 ) Ridge score: 3.4377 (0.6355)

 (alpha = 0.001 ) Ridge score: 3.4377 (0.6355)

 (alpha = 0.005 ) Ridge score: 3.4377 (0.6355)

 (alpha = 0.01 ) Ridge score: 3.4377 (0.6355)

 (alpha = 0.05 ) Ridge score: 3.4376 (0.6355)

 (alpha = 1 ) Ridge score: 3.4376 (0.6355)

 (alpha = 10 ) Ridge score: 3.4376 (0.6356)

 (alpha = 15 ) Ridge score: 3.4376 (0.6356)

 (alpha = 30 ) Ridge score: 3.4376 (0.6357)

4) LASSO Regression

In [121]:

alphas = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 10, 15, 30]

for alpha in alphas: 
    lasso = make_pipeline(RobustScaler(), Lasso(alpha = alpha, random_state=1))
    score = rmse_cv(lasso)
    print("\n (alpha = {} ) Lasso score: {:.4f} ({:.4f})".format(alpha, score.mean(), score.std()))
 (alpha = 0.0001 ) Lasso score: 3.4376 (0.6356)

 (alpha = 0.0005 ) Lasso score: 3.4375 (0.6358)

 (alpha = 0.001 ) Lasso score: 3.4375 (0.6361)

 (alpha = 0.005 ) Lasso score: 3.4375 (0.6382)

 (alpha = 0.01 ) Lasso score: 3.4381 (0.6406)

 (alpha = 0.05 ) Lasso score: 3.4498 (0.6494)

 (alpha = 1 ) Lasso score: 3.5948 (0.6997)

 (alpha = 10 ) Lasso score: 4.8555 (0.7025)

 (alpha = 15 ) Lasso score: 4.8555 (0.7025)

 (alpha = 30 ) Lasso score: 4.8555 (0.7025)

5) Elastic Net Regression

In [122]:

alphas = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 10, 15, 30]

for alpha in alphas: 
    ENet = make_pipeline(RobustScaler(), ElasticNet(alpha= alpha, l1_ratio=.7, random_state=3))
    score = rmse_cv(ENet)
    print("\n (alpha = {} ) Elastic Net score: {:.4f} ({:.4f})".format(alpha, score.mean(), score.std()))
 (alpha = 0.0001 ) Elastic Net score: 3.4376 (0.6356)

 (alpha = 0.0005 ) Elastic Net score: 3.4375 (0.6358)

 (alpha = 0.001 ) Elastic Net score: 3.4375 (0.6360)

 (alpha = 0.005 ) Elastic Net score: 3.4374 (0.6378)

 (alpha = 0.01 ) Elastic Net score: 3.4376 (0.6399)

 (alpha = 0.05 ) Elastic Net score: 3.4467 (0.6496)

 (alpha = 1 ) Elastic Net score: 3.5427 (0.6959)

 (alpha = 10 ) Elastic Net score: 4.8555 (0.7025)

 (alpha = 15 ) Elastic Net score: 4.8555 (0.7025)

 (alpha = 30 ) Elastic Net score: 4.8555 (0.7025)

2. Model Score

In [123]:

#Validation function
n_folds = 5

def rmse_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_train.values)
    rmse= np.sqrt(-cross_val_score(model, X_train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

lightGBM

In [124]:

lgbm = lgb.LGBMRegressor(learning_rate=0.05, min_child_samples=40, n_estimators=1000, num_leaves=5, subsample=0.6)

score = rmse_cv(lgbm)
print("lightGBM score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
lightGBM score: 2.6074 (0.1787)

XGBoost

In [126]:

# XGBoost 
XGBoost = xgb.XGBRegressor(gamma=0.05, learning_rate=0.05, max_depth=5, min_child_weight=4,
                           n_estimators=1000, subsample=0.8, verbose=True)

score = rmse_cv(XGBoost)
print("XGBoost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
[01:44:49] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[02:03:26] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[02:21:25] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[02:38:38] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[02:56:08] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
XGBoost score: 2.4738 (0.1866)

Random Forest (Dacon Score: 2.55016)

시간이 너무 오래 걸려서 K-Fold Cross Validation은 진행하지 않기로 함

In [134]:

RF = RandomForestRegressor(bootstrap=True, max_features='auto', n_estimators=500)

In [142]:

RF.fit(X_train.values, y_train)

Out[142]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [145]:

RF_train_pred = RF.predict(X_train.values)
RF_pred = RF.predict(X_test.values)

In [146]:

sub = pd.read_csv('submission_sample.csv')
sub['18~20_ride'] = RF_pred
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다 
sub.to_csv('submission.csv', index=False)

Ridge

In [127]:

ridge = make_pipeline(RobustScaler(), Ridge(alpha = 0.05, random_state=1))

score = rmse_cv(ridge)
print("\n Ridge score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
 (alpha = 30 ) Ridge score: 3.3025 (0.2512)

Lasso

In [128]:

lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.0005, random_state=1))

score = rmse_cv(lasso)
print("\n Lasso score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
 (alpha = 30 ) Lasso score: 3.3025 (0.2513)

Elastic Net

In [129]:

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha= 0.005, l1_ratio=.7, random_state=3))
score = rmse_cv(ENet)
print("\n Elastic Net score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
 (alpha = 30 ) Elastic Net score: 3.3028 (0.2520)

3. Stacking models

1) Average Base Models: LGBM + XGBoost (Dacon Score: 2.5897)

In [130]:

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

In [132]:

averaged_models = AveragingModels(models = (lgbm, XGBoost))

score = rmse_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
[12:10:07] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[12:27:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[12:45:27] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:02:38] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:19:55] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
 Averaged base models score: 2.5003 (0.1855)

In [135]:

averaged_models.fit(X_train.values, y_train)
averaged_train_pred = averaged_models.predict(X_train.values)
averaged_pred = averaged_models.predict(X_test.values)
[15:22:48] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

In [136]:

sub = pd.read_csv('submission_sample.csv')
sub['18~20_ride'] = averaged_pred
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다 
sub.to_csv('submission.csv', index=False)

2. Average Models: Random Forest + (lightGBM + XGBoost)

In [148]:

first = pd.read_csv('submission_rf.csv')
second = pd.read_csv('submission_lgbm+xgb.csv')

In [149]:

target = '18~20_ride'

1) RF 0.34, LGBM 0.33, XGB 0.33 (Dacon Score: 2.54172)

In [150]:

w1 = 0.34
w2 = 0.66

In [151]:

W = w1*first[target] + w2*second[target]

In [153]:

sub = pd.read_csv('submission_sample.csv')
sub[target] = W
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다 
sub.to_csv('submission_333.csv', index=False)

2) RF 0.5, LGBM 0.25, XGB 0.25 (Dacon Score: 2.53132)

In [154]:

w1 = 0.5
w2 = 0.5

In [155]:

W = w1*first[target] + w2*second[target]

In [156]:

sub = pd.read_csv('submission_sample.csv')
sub[target] = W
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다 
sub.to_csv('submission_555.csv', index=False)

3) RF 0.7, LGBM 0.15, XGB 0.15 (Dacon Score: 2.52953)

In [157]:

w1 = 0.7
w2 = 0.3

In [158]:

W = w1*first[target] + w2*second[target]

In [159]:

sub = pd.read_csv('submission_sample.csv')
sub[target] = W
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다 
sub.to_csv('submission_777.csv', index=False)

한계점...

  • 데이터를 조금 더 이해하고, 적절한 피쳐를 더 많이 만들어 봤으면 더 좋았을 것 같다 ex) 정류장 사이의 거리, 거리에 따른 이동 시간 등

  • 공휴일을 고려해 보지 못했다 (holiday 함수를 만들었는데 적용이 되지 않아 지쳐서 패스했는데, 다르게 적용해 볼 수 있는 방법을 나중에 깨달았다...)

  • 모델링 하고 적합 하는데에 시간이 너무 너무 너무 오래걸려서 다양한 시도를 진행해 보지 못했다 (Random Forest의 경우에도 CV 없이 1시간은 걸렸던 것 같고, XGBoost의 경우에는 colab에서도 20시간이 떴다...ㅎㅎㅎ) 나의 능력의 문제인지, 노트북 성능의 문제인지, 데이터의 크기가 커서 그런것인지 (41만개 * 36개) 아직 이유를 모르겠다

  • Random Forest의 경우 CV 진행하지 않았는데, 그 데이터로 스태킹 함수에 적용하는 방법을 알 수 없었다 (유나님이 앙상블 수업 때 올려주신 스태킹 코드로 시도해 봤는데, 오류나서 진행하지 않았다)

Last updated