Python을 이용한 실전 머신러닝 (2)
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.rc('font', family='Malgun Gothic')
plt.rc('axes', unicode_minus=False)
import warnings
warnings.filterwarnings('ignore')
In [2]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import ElasticNet, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
1. Data Description
출처: Dacon 13회 제주 퇴근시간 버스승차인원 예측
평가지표: RMSE
In [3]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
bus = pd.read_csv("bus_bts.csv")
1. train.csv / test.csv
train: 2019-09-01 ~ 2019-09-30 / test: 2019-10-01 ~ 2019-10-16
id : id 해당 데이터에서의 고유한 ID(train, test와의 중복은 없음)
data : 날짜
bus_route_id : 노선 ID
in_out : 시내버스, 시외버스 구분
station_code : 승하차 정류소 ID
station_name : 승하차 정류소 이름
lattitude : 위도 (같은 정류장 이름이어도 버스의 진행 방향에 따라 다를 수 있음)
logitude : 경도 (같은 정류장 이름이어도 버스의 진행 방향에 따라 다를 수 있음)
h~h+1_ride : 해당 시간 사이 승차한 인원수
h~h+1_takeoff : 해당 시간 사이 하차한 인원수
18~20_ride : 해당시간 사이 승차한 인원수 (target variable)
In [4]:
def display_data(data, num):
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # temporarily unleash limit
print('dataset shape is: {}'.format(data.shape))
display(data.head(num).append(data.tail(num)))
print("Number of missing values:\n",data.isnull().sum(),"\n")
In [5]:
display_data(train, 5)
dataset shape is: (415423, 21)
Number of missing values:
id 0
date 0
bus_route_id 0
in_out 0
station_code 0
station_name 0
latitude 0
longitude 0
6~7_ride 0
7~8_ride 0
8~9_ride 0
9~10_ride 0
10~11_ride 0
11~12_ride 0
6~7_takeoff 0
7~8_takeoff 0
8~9_takeoff 0
9~10_takeoff 0
10~11_takeoff 0
11~12_takeoff 0
18~20_ride 0
dtype: int64
In [6]:
display_data(test, 5)
dataset shape is: (228170, 20)
Number of missing values:
id 0
date 0
bus_route_id 0
in_out 0
station_code 0
station_name 0
latitude 0
longitude 0
6~7_ride 0
7~8_ride 0
8~9_ride 0
9~10_ride 0
10~11_ride 0
11~12_ride 0
6~7_takeoff 0
7~8_takeoff 0
8~9_takeoff 0
9~10_takeoff 0
10~11_takeoff 0
11~12_takeoff 0
dtype: int64
In [8]:
figure, (ax1, ax2) = plt.subplots(ncols=2)
figure.set_size_inches(12,4)
sns.scatterplot( x = 'longitude', y = 'latitude', data = train, alpha = 0.1, ax=ax1)
sns.scatterplot( x = 'longitude', y = 'latitude', data = test, alpha = 0.1, ax=ax2)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x180c98fb128>
왼쪽 위의 값이 이상치처럼 보이지만, test data에도 이러한 값이 존재하므로 제거하지 않기로 한다
2. bus_bts.csv
2019-09-01 ~ 2019-10-16
user_card_id: 승객의 버스카드ID
bus_route_id: 노선ID
vhc_id: 차량ID
geton_date: 승객이 탑승한 날짜
geton_time: 승객이 탑승한 시간
geton_station_code: 승차정류소의 ID
geton_station_name: 승차정류소의 이름
getoff_date: 해당 승객이 하차한 날짜 (하차태그 없는 경우, NaN)
getoff_time: 해당 승객이 하차한 시간 (하차태그 없는 경우, NaN)
getoff_station_code: 하차정류소의 ID (하차태그 없는 경우, NaN)
getoff_station_name: 하차정류소의 이름 (하차태그 없는 경우, NaN)
user_category: 승객 구분 (01-일반, 02-어린이, 04-청소년, 06-경로, 27-장애 일반, 28-장애 동반, 29-유공 일반, 30-유공 동반)
user_count: 해당 버스카드로 계산한 인원수 ( ex- 3은 3명 분의 버스비를 해당 카드 하나로 계산한 것)
In [9]:
display_data(bus, 5)
dataset shape is: (2409414, 13)
Number of missing values:
user_card_id 0
bus_route_id 0
vhc_id 0
geton_date 0
geton_time 0
geton_station_code 0
geton_station_name 49
getoff_date 895736
getoff_time 895736
getoff_station_code 895736
getoff_station_name 895775
user_category 0
user_count 0
dtype: int64
bus 데이터는 train, test 날짜의 데이터를 모두 가지고 있으므로, 날짜에 맞춰서 split 해 주어야 한다In [10]:
bus = bus.sort_values('geton_date').reset_index().drop(['index'], axis = 1)
In [11]:
display_data(bus.loc[bus["geton_date"] == "2019-10-01"], 1)
dataset shape is: (62682, 13)
Number of missing values:
user_card_id 0
bus_route_id 0
vhc_id 0
geton_date 0
geton_time 0
geton_station_code 0
geton_station_name 1
getoff_date 24861
getoff_time 24861
getoff_station_code 24861
getoff_station_name 24861
user_category 0
user_count 0
dtype: int64
In [12]:
bus_train = bus.loc[:1548758]
bus_test = bus.loc[1548759:]
In [13]:
bus_train.shape, bus_test.shape
Out[13]:
((1548759, 13), (860655, 13))
3. target distribution
In [14]:
train['18~20_ride'].describe()
Out[14]:
count 415423.000000
mean 1.242095
std 4.722287
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 272.000000
Name: 18~20_ride, dtype: float64
In [15]:
figure, (ax1, ax2) = plt.subplots(ncols=2)
figure.set_size_inches(12,4)
sns.boxplot(train[["18~20_ride"]], ax=ax1)
sns.distplot(train[["18~20_ride"]],ax=ax2)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x180c9a3ccc0>
예측해야 할 target 변수의 값이 0에 몰려있음을 알 수 있다
승차인원 예측 데이터의 경우 선형회귀를 진행하기보다는 부스팅 모델을 사용하는 것이 더 나아 보이므로, 로그변환은 하지 않기로 결정한다
2. Data Preprocessing
1. 날짜
1) datetime 형태로 바꿔주기
In [16]:
train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])
In [17]:
bus_train['geton_date'] = pd.to_datetime(bus_train['geton_date'])
bus_test['geton_date'] = pd.to_datetime(bus_test['geton_date'])
bus_train['getoff_date'] = pd.to_datetime(bus_train['getoff_date'])
bus_test['geton_date'] = pd.to_datetime(bus_test['geton_date'])
2) 날짜 변수 생성: day
In [18]:
train['day'] = pd.to_datetime(train['date']).dt.day
test['day'] = pd.to_datetime(test['date']).dt.day
3) 주말 변수 생성: 월~금 1, 토~일 0
In [19]:
# datetime weekday: 월요일 0 ~ 일요일 6
train['weekday'] = train['date'].dt.weekday
test['weekday'] = test['date'].dt.weekday
In [20]:
# 주말 변수 생성: 월~금 1, 토~일 0
train['weekend'] = train['weekday'].map(lambda x : 1 if x<=5 else 0)
test['weekend'] = test['weekday'].map(lambda x : 1 if x<=5 else 0)
In [25]:
figure, axes = plt.subplots(ncols=2)
figure.set_size_inches(12,5)
sns.countplot(train['weekend'], ax=axes[0])
sns.boxplot(x= train['weekend'], y=train["18~20_ride"], ax=axes[1])
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x180c999e4e0>
4) 공휴일 변수 생성 (적용되지 않음 ㅠㅠ)
In [21]:
def holiday(x): # 왜 안될까
if x in ['2019-09-12','2019-09-13','2019-10-03','2019-10-09']:
return 1
else:
return 0
In [22]:
train['holiday'] = train['date'].apply(holiday)
test['holiday'] = test['date'].apply(holiday)
In [24]:
train['holiday'].value_counts()
Out[24]:
0 415423
Name: holiday, dtype: int64
In [ ]:
def holiday9(x): # 2019-09-12, 2019-09-13
if x in [12, 13]:
return 1
else:
return 0
In [ ]:
def holiday10(x): # 2019-10-03, 2019-10-09
if x in [3, 9]:
return 1
else:
return 0
In [ ]:
train['holiday'] = train['day'].apply(holiday9)
test['holiday'] = test['day'].apply(holiday10)
2. 버스 타입
시내 / 시외버스: dummy 변수화
In [26]:
train['in_out'] = train['in_out'].map({'시내':0,'시외':1})
test['in_out'] = test['in_out'].map({'시내':0,'시외':1})
In [27]:
figure, axes = plt.subplots(ncols=2)
figure.set_size_inches(12,5)
sns.countplot(train['in_out'], ax=axes[0])
sns.boxplot(x= train['in_out'], y=train["18~20_ride"], ax=axes[1])
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1808014abe0>
3. bus_route_id
train/test data와 bus_bts data에서 겹치는 feature는 bus_route_id (노선ID) 이다. 이를 기준으로 새로운 변수를 생성하고, train/test data와 bus data를 합친다.
1) train/test 데이터와 bus_bts 데이터의 노선아이디의 unique 갯수가 같은 지 확인한다
In [28]:
df = pd.concat([train, test], sort=False)
len(df['bus_route_id'].unique())
Out[28]:
631
In [29]:
len(bus['bus_route_id'].unique())
Out[29]:
630
1개가 다른걸 알 수 있으며, bus_bts가 가지고 있지 않은 노선번호를 확인한다In [30]:
no_train_route = []
for i in train['bus_route_id'].unique():
if i not in bus_train['bus_route_id'].unique():
no_train_route.append(i)
print(len(no_train_route), no_train_route)
1 [30950000]
In [31]:
no_test_route = []
for i in test['bus_route_id'].unique():
if i not in bus_test['bus_route_id'].unique():
no_test_route.append(i)
print(len(no_test_route), no_test_route)
1 [31120000]
In [33]:
display(test.loc[test['bus_route_id']==31120000])
Out[33]:
1 rows × 24 columns
2) user_category 기준으로 새로운 feature 생성하기
각각 bus_route_id 에서, 어떤 타입의 탑승객이 몇 명씩 버스를 타고 내렸는지에 대한 변수를 생성한다
In [34]:
bus_cate_train = bus_train[['bus_route_id', 'geton_date', 'geton_station_code', 'user_category']]
bus_cate_train.head()
Out[34]:
In [35]:
bus_cate_train = pd.get_dummies(bus_train, columns=['user_category']) # user type 각각에 대한 dummy 변수 생성
In [36]:
bus_train_group = bus_cate_train.groupby(['bus_route_id']).sum().reset_index() # 각각의 bus_route에서, 타입별로 몇 명씩 승하차했는지 계산
In [37]:
bus_train_group.head()
Out[37]:
In [38]:
# test에도 적용
bus_cate_test = bus_test[['bus_route_id', 'geton_date', 'geton_station_code', 'user_category']]
In [39]:
bus_cate_test = pd.get_dummies(bus_test, columns=['user_category'])
In [40]:
bus_test_group = bus_cate_test.groupby(['bus_route_id']).sum().reset_index()
In [41]:
bus_test_group.head()
Out[41]:
In [42]:
bus_train_group.shape, bus_test_group.shape
Out[42]:
((612, 14), (600, 14))
In [43]:
train = pd.merge(train, bus_train_group, on = 'bus_route_id', how = 'left')
test = pd.merge(test, bus_test_group, on = 'bus_route_id', how='left')
In [44]:
train.shape, test.shape
Out[44]:
((415423, 38), (228170, 37))
In [45]:
display_data(train,5)
dataset shape is: (415423, 38)
Number of missing values:
id 0
date 0
bus_route_id 0
in_out 0
station_code 0
station_name 0
latitude 0
longitude 0
6~7_ride 0
7~8_ride 0
8~9_ride 0
9~10_ride 0
10~11_ride 0
11~12_ride 0
6~7_takeoff 0
7~8_takeoff 0
8~9_takeoff 0
9~10_takeoff 0
10~11_takeoff 0
11~12_takeoff 0
18~20_ride 0
day 0
weekday 0
weekend 0
holiday 0
user_card_id 2
vhc_id 2
geton_station_code 2
getoff_station_code 2
user_count 2
user_category_1 2
user_category_2 2
user_category_4 2
user_category_6 2
user_category_27 2
user_category_28 2
user_category_29 2
user_category_30 2
dtype: int64
In [74]:
# 결측값의 경우 train/test data에 bus data의 노선 번호가 존재하지 않는 경우 발생한다
# 이 때의 값을 살펴보면, 대부분 ride, takeoff 값이 0이므로 그냥 0으로 채워주기로 한다
train = train.fillna(0)
test = test.fillna(0)
4. bus_route_id / station_code / station_name
새로운 feature 생성
In [47]:
print('bus_route_id unique : {}'.format(len(train['bus_route_id'].unique())))
print('station_code unique : {}'.format(len(train['station_code'].unique())))
print('station_name unique : {}'.format(len(train['station_name'].unique())))
bus_route_id unique : 613
station_code unique : 3563
station_name unique : 1961
1) bus_route_id
bus_route_id(노선ID)를 그룹화하여, 18~20_ride 값의 평균을 구해 새로운 변수로 생성한다In [48]:
train_bus_route = train[['18~20_ride','bus_route_id']].groupby('bus_route_id').mean().sort_values('18~20_ride').reset_index()
In [49]:
train = pd.merge(train, train_bus_route, on = 'bus_route_id', how = 'left')
test = pd.merge(test, train_bus_route, on = 'bus_route_id', how='left')
In [50]:
train.rename(columns = {'18~20_ride_x' : '18~20_ride', '18~20_ride_y' : 'bus_route_mean'}, inplace = True)
test.rename(columns = {'18~20_ride' : 'bus_route_mean'}, inplace = True)
In [51]:
#test 데이터에 NaN값이 생기는데, 이것은 test데이터에 같은 key가 없으면 NaN값으로 대체된다.
#따라서, 그 경우 우선 train의 중앙값을 가지고와 test의 변수에 대체해준다.
test['bus_route_mean'].fillna(train['bus_route_mean'].median(),inplace = True)
2) station_code
station_code(승하차정류소 ID)를 그룹화하여, 18~20_ride 값의 평균을 구해 새로운 변수로 생성한다In [52]:
train_st_code = train[['18~20_ride','station_code']].groupby('station_code').mean().sort_values('18~20_ride').reset_index()
In [53]:
train = pd.merge(train, train_st_code, on = 'station_code', how = 'left')
test = pd.merge(test, train_st_code, on = 'station_code', how='left')
In [54]:
train.rename(columns = {'18~20_ride_x' : '18~20_ride', '18~20_ride_y' : 'station_code_mean'}, inplace = True)
test.rename(columns = {'18~20_ride' : 'station_code_mean'}, inplace = True)
In [55]:
# 위의 경우와 마찬가지로 train의 중앙값을 가지고와 test의 변수에 대체해준다.
test['station_code_mean'].fillna(train['station_code_mean'].median(),inplace = True)
3) station_name
station_name(승하차정류소 이름)을 그룹화하여, 18~20_ride 값의 평균을 구해 새로운 변수로 생성한다 station_code의 경우와 비슷한 값을 가지는 변수가 생성될 것이다In [56]:
train_name_code = train[['18~20_ride','station_name']].groupby('station_name').mean().sort_values('18~20_ride').reset_index()
In [57]:
train = pd.merge(train, train_name_code, on = 'station_name', how = 'left')
test = pd.merge(test, train_name_code, on = 'station_name', how='left')
In [58]:
train.rename(columns = {'18~20_ride_x' : '18~20_ride', '18~20_ride_y' : 'station_name_mean'}, inplace = True)
test.rename(columns = {'18~20_ride' : 'station_name_mean'}, inplace = True)
In [59]:
# 위의 경우와 마찬가지로 train의 중앙값을 가지고와 test의 변수에 대체해준다.
test['station_name_mean'].fillna(train['station_name_mean'].median(),inplace = True)
5. station_name
수요가 많을 것으로 예상되는 정류장
1) 학교
In [62]:
g = df[df['station_name'].str.contains('고등학교')]
highschool = list(g['station_name'].unique())
g = df[df['station_name'].str.contains('대학교')]
university = list(g['station_name'].unique())
In [63]:
def school(x):
if x in highschool:
return 1
elif x in university:
return 1
else:
return 0
In [64]:
train['school'] = train['station_name'].apply(school)
test['school'] = test['station_name'].apply(school)
2) 공항, 환승, 터미널
In [65]:
g = df[df['station_name'].str.contains('환승')]
transfer = list(g['station_name'].unique())
g = df[df['station_name'].str.contains('공항')]
airport = list(g['station_name'].unique())
g = df[df['station_name'].str.contains('터미널')]
terminal = list(g['station_name'].unique())
In [66]:
def station(x):
if x in transfer:
return 1
elif x in airport:
return 1
elif x in terminal:
return 1
else:
return 0
In [67]:
train['station'] = train['station_name'].apply(station)
test['station'] = test['station_name'].apply(station)
In [75]:
display_data(train, 3)
dataset shape is: (415423, 43)
Number of missing values:
id 0
date 0
bus_route_id 0
in_out 0
station_code 0
station_name 0
latitude 0
longitude 0
6~7_ride 0
7~8_ride 0
8~9_ride 0
9~10_ride 0
10~11_ride 0
11~12_ride 0
6~7_takeoff 0
7~8_takeoff 0
8~9_takeoff 0
9~10_takeoff 0
10~11_takeoff 0
11~12_takeoff 0
18~20_ride 0
day 0
weekday 0
weekend 0
holiday 0
user_card_id 0
vhc_id 0
geton_station_code 0
getoff_station_code 0
user_count 0
user_category_1 0
user_category_2 0
user_category_4 0
user_category_6 0
user_category_27 0
user_category_28 0
user_category_29 0
user_category_30 0
bus_route_mean 0
station_code_mean 0
station_name_mean 0
school 0
station 0
dtype: int64
6. 최종 전처리
1) 범주형 변수: LabelEncoding
In [76]:
cols = ('bus_route_id','station_code', "geton_station_code", "getoff_station_code")
for col in cols:
le = LabelEncoder()
le.fit(list(train[col].values))
train[col] = le.transform(list(train[col].values))
le.fit(list(test[col].values))
test[col] = le.transform(list(test[col].values))
print(train.shape, test.shape)
(415423, 43) (228170, 42)
2) id 변수: drop features
In [80]:
dropfeatures = ["id", "date", "station_name", "user_card_id", "vhc_id",
"latitude", "longitude"]
In [81]:
train = train.drop(dropfeatures, axis=1)
test = test.drop(dropfeatures, axis=1)
In [82]:
display_data(train, 3)
dataset shape is: (415423, 36)
Number of missing values:
bus_route_id 0
in_out 0
station_code 0
6~7_ride 0
7~8_ride 0
8~9_ride 0
9~10_ride 0
10~11_ride 0
11~12_ride 0
6~7_takeoff 0
7~8_takeoff 0
8~9_takeoff 0
9~10_takeoff 0
10~11_takeoff 0
11~12_takeoff 0
18~20_ride 0
day 0
weekday 0
weekend 0
holiday 0
geton_station_code 0
getoff_station_code 0
user_count 0
user_category_1 0
user_category_2 0
user_category_4 0
user_category_6 0
user_category_27 0
user_category_28 0
user_category_29 0
user_category_30 0
bus_route_mean 0
station_code_mean 0
station_name_mean 0
school 0
station 0
dtype: int64
In [83]:
display_data(test, 3)
dataset shape is: (228170, 35)
Number of missing values:
bus_route_id 0
in_out 0
station_code 0
6~7_ride 0
7~8_ride 0
8~9_ride 0
9~10_ride 0
10~11_ride 0
11~12_ride 0
6~7_takeoff 0
7~8_takeoff 0
8~9_takeoff 0
9~10_takeoff 0
10~11_takeoff 0
11~12_takeoff 0
day 0
weekday 0
weekend 0
holiday 0
geton_station_code 0
getoff_station_code 0
user_count 0
user_category_1 0
user_category_2 0
user_category_4 0
user_category_6 0
user_category_27 0
user_category_28 0
user_category_29 0
user_category_30 0
bus_route_mean 0
station_code_mean 0
station_name_mean 0
school 0
station 0
dtype: int64
3. Modeling
0. LGBM (Dacon SCORE: 2.58261)
일단 부스팅 모델에 적합해보기
In [84]:
X_train = train.drop(["18~20_ride"], axis=1)
X_test = test
y_train = train["18~20_ride"]
In [85]:
X_train.shape, X_test.shape, y_train.shape
Out[85]:
((415423, 35), (228170, 35), (415423,))
In [140]:
#Validation function
n_folds = 5
def rmse_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_train.values)
rmse= np.sqrt(-cross_val_score(model, X_train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
return(rmse)
In [92]:
grid_params = {
'n_estimators': [200, 500, 1000],
'learning_rate' : [0.025, 0.05],
'num_leaves': [3, 4, 5],
'min_child_samples' : [40, 60],
'subsample' : [ 0.6, 0.8 ]
}
In [93]:
gs = GridSearchCV(
lgb.LGBMRegressor(),
grid_params,
verbose = 10,
cv = 3,
n_jobs = -1
)
In [94]:
gs_results = gs.fit(X_train, y_train, verbose=True)
Fitting 3 folds for each of 72 candidates, totalling 216 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 45.5s
[Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 1.1min
[Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 1.9min
[Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 2.9min
[Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 5.0min
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 8.0min
[Parallel(n_jobs=-1)]: Done 53 tasks | elapsed: 12.2min
[Parallel(n_jobs=-1)]: Done 64 tasks | elapsed: 13.3min
[Parallel(n_jobs=-1)]: Done 77 tasks | elapsed: 15.4min
[Parallel(n_jobs=-1)]: Done 90 tasks | elapsed: 18.3min
[Parallel(n_jobs=-1)]: Done 105 tasks | elapsed: 24.6min
[Parallel(n_jobs=-1)]: Done 120 tasks | elapsed: 26.7min
[Parallel(n_jobs=-1)]: Done 137 tasks | elapsed: 29.9min
[Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 36.6min
[Parallel(n_jobs=-1)]: Done 173 tasks | elapsed: 41.4min
[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 44.4min
[Parallel(n_jobs=-1)]: Done 216 out of 216 | elapsed: 53.8min finished
In [95]:
print("Best Parameter: {}".format(gs.best_params_))
Best Parameter: {'learning_rate': 0.05, 'min_child_samples': 40, 'n_estimators': 1000, 'num_leaves': 5, 'subsample': 0.6}
In [97]:
model_lgb = lgb.LGBMRegressor(learning_rate=0.05, min_child_samples=40, n_estimators=1000, num_leaves=5, subsample=0.6)
In [98]:
# X_train만 넣어서 모델 성능 평가해야 하는데... y값이 포함된 train 데이터셋을 넣어서 성능이 매우 좋게 나왔습니다...
# 다시 할 엄두가 나지 않아서......... 기회가 된다면 다시 해 보겠습니다...
score = rmse_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))
LGBM score: 0.4206 (0.1550)
In [100]:
model_lgb.fit(X_train.values, y_train)
lgb_train_pred = model_lgb.predict(X_train.values)
lgb_pred = model_lgb.predict(X_test.values)
In [102]:
sub = pd.read_csv('submission_sample.csv')
sub['18~20_ride'] = lgb_pred
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다
sub.to_csv('submission.csv', index=False)
1. Grid Search
데이터가 너무 많아서 grid search하는 데에 오래 걸리므로, 랜덤으로 20%만 뽑아서 최적의 hyperparameter를 찾아보기로 한다
LGBM 외에도 다른 괜찮은 모델이 있는지 찾아보기로 한다
In [104]:
train2= train.sample(frac = 0.2, random_state=28)
test2 = test.sample(frac = 0.2, random_state=28)
train2.shape, test2.shape
Out[104]:
((83085, 36), (45634, 35))
In [105]:
X_train2 = train2.drop(["18~20_ride"], axis=1)
X_test2 = test2
y_train2 = train2["18~20_ride"]
In [106]:
X_train2.shape, X_test2.shape, y_train2.shape
Out[106]:
((83085, 35), (45634, 35), (83085,))
1) Gradient Boosting Regression
In [111]:
grid_params = {
'loss': ['huber'], 'learning_rate': [0.02, 0.05], 'n_estimators': [500, 1000],
'max_depth':[3, 5], 'min_samples_leaf':[2, 3], 'subsample' : [0.6, 0.8]
}
In [112]:
gs = RandomizedSearchCV(
GradientBoostingRegressor(),
grid_params,
verbose = 10,
cv = 3,
n_jobs = -1
)
In [113]:
gs_results = gs.fit(X_train2, y_train2)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 9.7min
[Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 20.7min
[Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 31.3min
[Parallel(n_jobs=-1)]: Done 27 out of 30 | elapsed: 43.0min remaining: 4.8min
[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 45.2min finished
In [114]:
print("Best Parameter: {}".format(gs.best_params_))
Best Parameter: {'subsample': 0.8, 'n_estimators': 1000, 'min_samples_leaf': 3, 'max_depth': 5, 'loss': 'huber', 'learning_rate': 0.05}
2) XGBoost
In [115]:
grid_params = {
'n_estimators': [ 500, 1000 ],
"learning_rate" : [ 0.02, 0.05 ] ,
"max_depth" : [ 3, 5 ],
"min_child_weight" : [ 2, 4 ],
"gamma" : [ 0.05 ],
'subsample' : [ 0.6, 0.8 ]}
In [116]:
gs = RandomizedSearchCV(
xgb.XGBRegressor(),
grid_params,
verbose = 10,
cv = 3,
n_jobs = -1
)
In [117]:
gs_results = gs.fit(X_train2, y_train2)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 12.4min
[Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 14.3min
[Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 21.8min
[Parallel(n_jobs=-1)]: Done 27 out of 30 | elapsed: 28.4min remaining: 3.2min
[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 30.6min finished
[18:56:47] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
In [118]:
print("Best Parameter: {}".format(gs.best_params_))
Best Parameter: {'subsample': 0.8, 'n_estimators': 1000, 'min_child_weight': 4, 'max_depth': 5, 'learning_rate': 0.05, 'gamma': 0.05}
3) Ridge
In [119]:
#Validation function
n_folds = 5
def rmse_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_train2.values)
rmse= np.sqrt(-cross_val_score(model, X_train2.values, y_train2, scoring="neg_mean_squared_error", cv = kf))
return(rmse)
In [120]:
alphas = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 10, 15, 30]
for alpha in alphas:
ridge = make_pipeline(RobustScaler(), Ridge(alpha = alpha, random_state=1))
score = rmse_cv(ridge)
print("\n (alpha = {} ) Ridge score: {:.4f} ({:.4f})".format(alpha, score.mean(), score.std()))
(alpha = 0.0001 ) Ridge score: 3.4377 (0.6355)
(alpha = 0.0005 ) Ridge score: 3.4377 (0.6355)
(alpha = 0.001 ) Ridge score: 3.4377 (0.6355)
(alpha = 0.005 ) Ridge score: 3.4377 (0.6355)
(alpha = 0.01 ) Ridge score: 3.4377 (0.6355)
(alpha = 0.05 ) Ridge score: 3.4376 (0.6355)
(alpha = 1 ) Ridge score: 3.4376 (0.6355)
(alpha = 10 ) Ridge score: 3.4376 (0.6356)
(alpha = 15 ) Ridge score: 3.4376 (0.6356)
(alpha = 30 ) Ridge score: 3.4376 (0.6357)
4) LASSO Regression
In [121]:
alphas = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 10, 15, 30]
for alpha in alphas:
lasso = make_pipeline(RobustScaler(), Lasso(alpha = alpha, random_state=1))
score = rmse_cv(lasso)
print("\n (alpha = {} ) Lasso score: {:.4f} ({:.4f})".format(alpha, score.mean(), score.std()))
(alpha = 0.0001 ) Lasso score: 3.4376 (0.6356)
(alpha = 0.0005 ) Lasso score: 3.4375 (0.6358)
(alpha = 0.001 ) Lasso score: 3.4375 (0.6361)
(alpha = 0.005 ) Lasso score: 3.4375 (0.6382)
(alpha = 0.01 ) Lasso score: 3.4381 (0.6406)
(alpha = 0.05 ) Lasso score: 3.4498 (0.6494)
(alpha = 1 ) Lasso score: 3.5948 (0.6997)
(alpha = 10 ) Lasso score: 4.8555 (0.7025)
(alpha = 15 ) Lasso score: 4.8555 (0.7025)
(alpha = 30 ) Lasso score: 4.8555 (0.7025)
5) Elastic Net Regression
In [122]:
alphas = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 10, 15, 30]
for alpha in alphas:
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha= alpha, l1_ratio=.7, random_state=3))
score = rmse_cv(ENet)
print("\n (alpha = {} ) Elastic Net score: {:.4f} ({:.4f})".format(alpha, score.mean(), score.std()))
(alpha = 0.0001 ) Elastic Net score: 3.4376 (0.6356)
(alpha = 0.0005 ) Elastic Net score: 3.4375 (0.6358)
(alpha = 0.001 ) Elastic Net score: 3.4375 (0.6360)
(alpha = 0.005 ) Elastic Net score: 3.4374 (0.6378)
(alpha = 0.01 ) Elastic Net score: 3.4376 (0.6399)
(alpha = 0.05 ) Elastic Net score: 3.4467 (0.6496)
(alpha = 1 ) Elastic Net score: 3.5427 (0.6959)
(alpha = 10 ) Elastic Net score: 4.8555 (0.7025)
(alpha = 15 ) Elastic Net score: 4.8555 (0.7025)
(alpha = 30 ) Elastic Net score: 4.8555 (0.7025)
2. Model Score
In [123]:
#Validation function
n_folds = 5
def rmse_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_train.values)
rmse= np.sqrt(-cross_val_score(model, X_train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
return(rmse)
lightGBM
In [124]:
lgbm = lgb.LGBMRegressor(learning_rate=0.05, min_child_samples=40, n_estimators=1000, num_leaves=5, subsample=0.6)
score = rmse_cv(lgbm)
print("lightGBM score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
lightGBM score: 2.6074 (0.1787)
XGBoost
In [126]:
# XGBoost
XGBoost = xgb.XGBRegressor(gamma=0.05, learning_rate=0.05, max_depth=5, min_child_weight=4,
n_estimators=1000, subsample=0.8, verbose=True)
score = rmse_cv(XGBoost)
print("XGBoost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
[01:44:49] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[02:03:26] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[02:21:25] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[02:38:38] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[02:56:08] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
XGBoost score: 2.4738 (0.1866)
Random Forest (Dacon Score: 2.55016)
시간이 너무 오래 걸려서 K-Fold Cross Validation은 진행하지 않기로 함
In [134]:
RF = RandomForestRegressor(bootstrap=True, max_features='auto', n_estimators=500)
In [142]:
RF.fit(X_train.values, y_train)
Out[142]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False)
In [145]:
RF_train_pred = RF.predict(X_train.values)
RF_pred = RF.predict(X_test.values)
In [146]:
sub = pd.read_csv('submission_sample.csv')
sub['18~20_ride'] = RF_pred
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다
sub.to_csv('submission.csv', index=False)
Ridge
In [127]:
ridge = make_pipeline(RobustScaler(), Ridge(alpha = 0.05, random_state=1))
score = rmse_cv(ridge)
print("\n Ridge score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
(alpha = 30 ) Ridge score: 3.3025 (0.2512)
Lasso
In [128]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.0005, random_state=1))
score = rmse_cv(lasso)
print("\n Lasso score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
(alpha = 30 ) Lasso score: 3.3025 (0.2513)
Elastic Net
In [129]:
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha= 0.005, l1_ratio=.7, random_state=3))
score = rmse_cv(ENet)
print("\n Elastic Net score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
(alpha = 30 ) Elastic Net score: 3.3028 (0.2520)
3. Stacking models
1) Average Base Models: LGBM + XGBoost (Dacon Score: 2.5897)
In [130]:
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models
# we define clones of the original models to fit the data in
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]
# Train cloned base models
for model in self.models_:
model.fit(X, y)
return self
#Now we do the predictions for cloned models and average them
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)
In [132]:
averaged_models = AveragingModels(models = (lgbm, XGBoost))
score = rmse_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
[12:10:07] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[12:27:44] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[12:45:27] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:02:38] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[13:19:55] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Averaged base models score: 2.5003 (0.1855)
In [135]:
averaged_models.fit(X_train.values, y_train)
averaged_train_pred = averaged_models.predict(X_train.values)
averaged_pred = averaged_models.predict(X_test.values)
[15:22:48] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
In [136]:
sub = pd.read_csv('submission_sample.csv')
sub['18~20_ride'] = averaged_pred
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다
sub.to_csv('submission.csv', index=False)
2. Average Models: Random Forest + (lightGBM + XGBoost)
In [148]:
first = pd.read_csv('submission_rf.csv')
second = pd.read_csv('submission_lgbm+xgb.csv')
In [149]:
target = '18~20_ride'
1) RF 0.34, LGBM 0.33, XGB 0.33 (Dacon Score: 2.54172)
In [150]:
w1 = 0.34
w2 = 0.66
In [151]:
W = w1*first[target] + w2*second[target]
In [153]:
sub = pd.read_csv('submission_sample.csv')
sub[target] = W
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다
sub.to_csv('submission_333.csv', index=False)
2) RF 0.5, LGBM 0.25, XGB 0.25 (Dacon Score: 2.53132)
In [154]:
w1 = 0.5
w2 = 0.5
In [155]:
W = w1*first[target] + w2*second[target]
In [156]:
sub = pd.read_csv('submission_sample.csv')
sub[target] = W
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다
sub.to_csv('submission_555.csv', index=False)
3) RF 0.7, LGBM 0.15, XGB 0.15 (Dacon Score: 2.52953)
In [157]:
w1 = 0.7
w2 = 0.3
In [158]:
W = w1*first[target] + w2*second[target]
In [159]:
sub = pd.read_csv('submission_sample.csv')
sub[target] = W
sub.loc[sub['18~20_ride']<0, '18~20_ride'] = 0 # 승차인원이 (-)일 수는 없으므로, 0보다 작은 값은 0으로 채워준다
sub.to_csv('submission_777.csv', index=False)
한계점...
데이터를 조금 더 이해하고, 적절한 피쳐를 더 많이 만들어 봤으면 더 좋았을 것 같다 ex) 정류장 사이의 거리, 거리에 따른 이동 시간 등
공휴일을 고려해 보지 못했다 (holiday 함수를 만들었는데 적용이 되지 않아 지쳐서 패스했는데, 다르게 적용해 볼 수 있는 방법을 나중에 깨달았다...)
모델링 하고 적합 하는데에 시간이 너무 너무 너무 오래걸려서 다양한 시도를 진행해 보지 못했다 (Random Forest의 경우에도 CV 없이 1시간은 걸렸던 것 같고, XGBoost의 경우에는 colab에서도 20시간이 떴다...ㅎㅎㅎ) 나의 능력의 문제인지, 노트북 성능의 문제인지, 데이터의 크기가 커서 그런것인지 (41만개 * 36개) 아직 이유를 모르겠다
Random Forest의 경우 CV 진행하지 않았는데, 그 데이터로 스태킹 함수에 적용하는 방법을 알 수 없었다 (유나님이 앙상블 수업 때 올려주신 스태킹 코드로 시도해 봤는데, 오류나서 진행하지 않았다)
Last updated