Datathon-2: Cancer Death Rate Prediction¶
Task 1¶
The first task consists in importing the libraries that will be used to load, visualize, prepare, and model the data.
Importing Libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Loading the data and displaying the first 5 rows.¶
cancer_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/cancer_death_rate/Training_set_label.csv" )
cancer_data.head()
avgAnnCount | avgDeathsPerYear | incidenceRate | medIncome | popEst2015 | povertyPercent | studyPerCap | binnedInc | MedianAge | MedianAgeMale | MedianAgeFemale | Geography | AvgHouseholdSize | PercentMarried | PctNoHS18_24 | PctHS18_24 | PctSomeCol18_24 | PctBachDeg18_24 | PctHS25_Over | PctBachDeg25_Over | PctEmployed16_Over | PctUnemployed16_Over | PctPrivateCoverage | PctPrivateCoverageAlone | PctEmpPrivCoverage | PctPublicCoverage | PctPublicCoverageAlone | PctWhite | PctBlack | PctAsian | PctOtherRace | PctMarriedHouseholds | BirthRate | TARGET_deathRate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19.0 | 8 | 481.5 | 50038 | 2704 | 11.1 | 0.0 | (48021.6, 51046.4] | 48.4 | 49.6 | 46.4 | Hettinger County, North Dakota | 2.25 | 65.9 | 10.8 | 25.0 | 57.4 | 6.8 | 37.1 | 12.2 | 57.4 | 1.1 | 81.2 | 56.0 | 35.7 | 34.7 | 9.9 | 96.032049 | 0.724914 | 0.000000 | 0.000000 | 62.511457 | 15.157116 | 160.3 |
1 | 88.0 | 34 | 486.0 | 59399 | 14844 | 9.7 | 0.0 | (54545.6, 61494.5] | 41.9 | 41.3 | 43.2 | Mills County, Iowa | 2.63 | 58.6 | 22.3 | 29.1 | NaN | 1.1 | 35.9 | 16.0 | 60.4 | 3.8 | 76.7 | NaN | 50.8 | 32.1 | 12.8 | 97.537344 | 0.719957 | 0.080743 | 0.040371 | 61.641045 | 3.293510 | 194.9 |
2 | 195.0 | 83 | 475.7 | 39721 | 25164 | 18.5 | 0.0 | (37413.8, 40362.7] | 48.9 | 47.9 | 49.9 | Gladwin County, Michigan | 2.30 | 57.2 | 24.9 | 36.2 | NaN | 3.5 | 40.2 | 7.6 | 41.2 | 11.0 | 61.6 | NaN | 32.1 | 49.8 | 21.6 | 97.576566 | 0.360770 | 0.411749 | 0.082350 | 53.978102 | 6.390328 | 196.5 |
3 | 116.0 | 55 | 496.6 | 30299 | 17917 | 28.1 | 0.0 | [22640, 34218.1] | 44.2 | 42.7 | 45.2 | Fentress County, Tennessee | 2.43 | 53.0 | 10.9 | 51.8 | NaN | 5.3 | 44.2 | 7.0 | 41.6 | 10.4 | 45.2 | NaN | 24.2 | 53.2 | 33.0 | 97.908650 | 0.161731 | 0.306731 | 0.340193 | 51.013143 | 5.124836 | 230.9 |
4 | 80.0 | 35 | 372.0 | 39625 | 14058 | 17.4 | 0.0 | (37413.8, 40362.7] | 45.0 | 42.2 | 48.0 | Las Animas County, Colorado | 2.36 | 52.2 | 12.6 | 31.2 | NaN | 0.2 | 28.3 | 10.5 | 49.3 | 9.2 | 56.6 | NaN | 33.4 | 43.0 | 24.7 | 82.672551 | 1.834103 | 0.682617 | 8.253465 | 50.566426 | 3.897033 | 162.2 |
Exploratory Data Analysis¶
The info
method can be used as a first step in the Exploratory Data Analysis (EDA) to get a first glance at the dataset.
cancer_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3051 entries, 0 to 3050 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 avgAnnCount 3051 non-null float64 1 avgDeathsPerYear 3051 non-null int64 2 incidenceRate 3051 non-null float64 3 medIncome 3051 non-null int64 4 popEst2015 3051 non-null int64 5 povertyPercent 3051 non-null float64 6 studyPerCap 3051 non-null float64 7 binnedInc 3051 non-null object 8 MedianAge 3051 non-null float64 9 MedianAgeMale 3051 non-null float64 10 MedianAgeFemale 3051 non-null float64 11 Geography 3051 non-null object 12 AvgHouseholdSize 3051 non-null float64 13 PercentMarried 3051 non-null float64 14 PctNoHS18_24 3051 non-null float64 15 PctHS18_24 3051 non-null float64 16 PctSomeCol18_24 785 non-null float64 17 PctBachDeg18_24 3051 non-null float64 18 PctHS25_Over 3051 non-null float64 19 PctBachDeg25_Over 3051 non-null float64 20 PctEmployed16_Over 2899 non-null float64 21 PctUnemployed16_Over 3051 non-null float64 22 PctPrivateCoverage 3051 non-null float64 23 PctPrivateCoverageAlone 2447 non-null float64 24 PctEmpPrivCoverage 3051 non-null float64 25 PctPublicCoverage 3051 non-null float64 26 PctPublicCoverageAlone 3051 non-null float64 27 PctWhite 3051 non-null float64 28 PctBlack 3051 non-null float64 29 PctAsian 3051 non-null float64 30 PctOtherRace 3051 non-null float64 31 PctMarriedHouseholds 3051 non-null float64 32 BirthRate 3051 non-null float64 33 TARGET_deathRate 3051 non-null float64 dtypes: float64(29), int64(3), object(2) memory usage: 810.5+ KB
It is posible to see that there are 33 columns in total, out of which only two columns contain non-numerical values. The describe
method can be used to view the main summary statistics for the numerical columns.
cancer_data.describe()
avgAnnCount | avgDeathsPerYear | incidenceRate | medIncome | popEst2015 | povertyPercent | studyPerCap | MedianAge | MedianAgeMale | MedianAgeFemale | AvgHouseholdSize | PercentMarried | PctNoHS18_24 | PctHS18_24 | PctSomeCol18_24 | PctBachDeg18_24 | PctHS25_Over | PctBachDeg25_Over | PctEmployed16_Over | PctUnemployed16_Over | PctPrivateCoverage | PctPrivateCoverageAlone | PctEmpPrivCoverage | PctPublicCoverage | PctPublicCoverageAlone | PctWhite | PctBlack | PctAsian | PctOtherRace | PctMarriedHouseholds | BirthRate | TARGET_deathRate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3.051000e+03 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 785.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 2899.000000 | 3051.000000 | 3051.000000 | 2447.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 | 3051.000000 |
mean | 570.668154 | 176.000983 | 449.007042 | 46902.917076 | 9.495799e+04 | 16.880367 | 158.695936 | 44.303540 | 39.600885 | 42.234579 | 2.473048 | 51.885480 | 18.225139 | 34.920190 | 41.247898 | 6.131957 | 34.909440 | 13.212750 | 54.115626 | 7.841069 | 64.519338 | 48.573314 | 41.342347 | 36.287545 | 19.198820 | 84.005243 | 9.166570 | 1.198561 | 1.864829 | 51.355837 | 5.608851 | 178.909767 |
std | 1250.546532 | 445.042777 | 52.886386 | 11902.460659 | 2.761007e+05 | 6.340462 | 544.035590 | 38.704107 | 5.177629 | 5.253474 | 0.429926 | 6.812846 | 8.105799 | 9.021475 | 11.107006 | 4.552222 | 7.015728 | 5.360342 | 8.267656 | 3.454863 | 10.511932 | 10.011218 | 9.327793 | 7.748442 | 6.023164 | 16.126982 | 14.676772 | 2.369931 | 3.235204 | 6.524964 | 1.955201 | 27.570075 |
min | 6.000000 | 3.000000 | 211.100000 | 22640.000000 | 8.270000e+02 | 3.200000 | 0.000000 | 22.300000 | 22.400000 | 22.300000 | 0.022100 | 25.100000 | 0.000000 | 0.000000 | 7.100000 | 0.000000 | 7.500000 | 3.200000 | 17.600000 | 0.400000 | 23.400000 | 16.800000 | 14.300000 | 11.800000 | 2.600000 | 11.008762 | 0.000000 | 0.000000 | 0.000000 | 23.915652 | 0.000000 | 66.300000 |
25% | 80.000000 | 29.000000 | 421.800000 | 38752.000000 | 1.236850e+04 | 12.200000 | 0.000000 | 37.900000 | 36.400000 | 39.200000 | 2.370000 | 47.800000 | 12.800000 | 29.300000 | 34.000000 | 3.100000 | 30.650000 | 9.300000 | 48.600000 | 5.500000 | 57.500000 | 41.300000 | 34.700000 | 31.000000 | 14.900000 | 78.012571 | 0.616576 | 0.261748 | 0.282825 | 47.736828 | 4.499936 | 161.400000 |
50% | 171.000000 | 62.000000 | 453.549422 | 45098.000000 | 2.677700e+04 | 15.900000 | 0.000000 | 41.000000 | 39.500000 | 42.400000 | 2.500000 | 52.500000 | 17.200000 | 34.700000 | 41.000000 | 5.300000 | 35.400000 | 12.300000 | 54.400000 | 7.600000 | 65.300000 | 48.700000 | 41.300000 | 36.300000 | 18.800000 | 90.318790 | 2.276756 | 0.557031 | 0.791571 | 51.757925 | 5.384471 | 178.300000 |
75% | 508.000000 | 148.000000 | 481.300000 | 52410.500000 | 6.853600e+04 | 20.400000 | 86.581336 | 43.900000 | 42.500000 | 45.300000 | 2.630000 | 56.500000 | 22.600000 | 40.700000 | 46.900000 | 8.100000 | 39.700000 | 16.100000 | 60.300000 | 9.700000 | 72.200000 | 55.700000 | 47.700000 | 41.400000 | 23.000000 | 95.577396 | 10.326954 | 1.189955 | 2.080241 | 55.465803 | 6.473896 | 195.500000 |
max | 24965.000000 | 9445.000000 | 1206.900000 | 125635.000000 | 5.238216e+06 | 47.000000 | 9762.308998 | 525.600000 | 64.700000 | 65.700000 | 3.930000 | 72.500000 | 64.100000 | 72.500000 | 79.000000 | 51.800000 | 54.800000 | 40.400000 | 76.500000 | 29.400000 | 92.300000 | 78.900000 | 70.700000 | 65.100000 | 46.600000 | 100.000000 | 84.866024 | 35.640183 | 38.743747 | 71.703057 | 21.326165 | 362.800000 |
The target feature TARGET_deathRate
can be visualized with a boxplot, in order to identify if there are any outliers in the dataset.
sns.boxplot(data=cancer_data, x='TARGET_deathRate')
<matplotlib.axes._subplots.AxesSubplot at 0x7f5a7127ee48>
It seems that there is an extreme outlier beyond the whisker endline on the right, and several outliers on both sides. The best option to avoid overfitting, would be to take the most extreme outliers out of the dataset, while still leaving some of the outliers that are not sginificantly away from the endlines. This process will be carried out in the data preparation section below.
The next step in the EDA will be to handle the missing values.
cancer_data.isnull().sum().sort_values(ascending=False).head(10)
PctSomeCol18_24 2266 PctPrivateCoverageAlone 604 PctEmployed16_Over 152 PctHS18_24 0 PercentMarried 0 AvgHouseholdSize 0 Geography 0 MedianAgeFemale 0 MedianAgeMale 0 MedianAge 0 dtype: int64
columns = cancer_data.columns
We will get the three columns that have missing values in a single list, in order to visualize whether they are correlated to the target variable or not. If there is no relationship between the features and the target variable, it would be possible to safely drop the columns entirely, instead of imputing the values.
null_columns = [column for column in columns if cancer_data.isnull().sum()[column] != 0]
sns.heatmap(cancer_data[[*null_columns, 'TARGET_deathRate']].corr(), annot=True);
cancer_data.corr()['TARGET_deathRate'].sort_values()
PctBachDeg25_Over -0.480939 medIncome -0.433311 PctEmployed16_Over -0.414695 PctPrivateCoverage -0.382859 PctPrivateCoverageAlone -0.364386 PctMarriedHouseholds -0.298414 PctBachDeg18_24 -0.292220 PercentMarried -0.262946 PctEmpPrivCoverage -0.257117 PctSomeCol18_24 -0.206337 PctAsian -0.202352 PctOtherRace -0.189210 PctWhite -0.173265 avgAnnCount -0.130745 popEst2015 -0.111470 BirthRate -0.088322 avgDeathsPerYear -0.074008 AvgHouseholdSize -0.030288 studyPerCap -0.023890 MedianAgeMale -0.002006 MedianAge 0.002772 MedianAgeFemale 0.034693 PctNoHS18_24 0.075815 PctBlack 0.250954 PctHS18_24 0.284328 PctUnemployed16_Over 0.379085 PctHS25_Over 0.403449 PctPublicCoverage 0.422291 povertyPercent 0.427118 PctPublicCoverageAlone 0.456804 incidenceRate 0.467683 TARGET_deathRate 1.000000 Name: TARGET_deathRate, dtype: float64
Task 2¶
Data Preparation¶
cancer_data.isnull().sum().sort_values(ascending=False).head()
PctSomeCol18_24 2266 PctPrivateCoverageAlone 604 PctEmployed16_Over 152 PctHS18_24 0 PercentMarried 0 dtype: int64
There is a significant number of missing values in the PctSomeCol18_24
column; therefore, it is better to drop the column.
For the other two columns the mean of each column will be used to fill the missing values.
cancer_data.drop(columns='PctSomeCol18_24', inplace=True)
cancer_data.fillna(value=cancer_data.mean(), inplace=True)
As mentioned previously, there are only two non-numerical columns. One of those, corresponds to the binned income values for each county, which shows a lower and an upper bound for the income.
cancer_data.binnedInc
0 (48021.6, 51046.4] 1 (54545.6, 61494.5] 2 (37413.8, 40362.7] 3 [22640, 34218.1] 4 (37413.8, 40362.7] ... 3046 (34218.1, 37413.8] 3047 (37413.8, 40362.7] 3048 (51046.4, 54545.6] 3049 (40362.7, 42724.4] 3050 [22640, 34218.1] Name: binnedInc, Length: 3051, dtype: object
It is not clear whether this values are significant, which makes it necessary to extract them and incorporate them into the main DataFrame.
The process used to extract valuable information from the binnedInc
column consists on stripping the parenthesis and square brackets, splitting the two values, and calculating the mean of the two boundaries. Finally, the mean value for each row in the column is added to the middle_income
list, which is then incorporated into the cancer_data
DataFrame.
income_limits = cancer_data['binnedInc'].str.strip('\(\[\]')
middle_income = []
for interval in income_limits.str.split(','):
lower, upper = float(interval[0]), float(interval[1])
middle_income.append(np.mean(np.array([lower, upper])))
cancer_data['middle_income'] = middle_income
Now the new middle_income
column can be compared to the existing medIncome
column from the original dataset.
cancer_data[['medIncome', 'middle_income']]
medIncome | middle_income | |
---|---|---|
0 | 50038 | 49534.00 |
1 | 59399 | 58020.05 |
2 | 39721 | 38888.25 |
3 | 30299 | 28429.05 |
4 | 39625 | 38888.25 |
... | ... | ... |
3046 | 34597 | 35815.95 |
3047 | 40002 | 38888.25 |
3048 | 51923 | 52796.00 |
3049 | 40788 | 41543.55 |
3050 | 29415 | 28429.05 |
3051 rows × 2 columns
Given that it does not add any valuable information that is not represented by the medIncome
column, it is a good idea to drop the middle_income
column, to reduce the number of features in the dataset.
cancer_data.drop(columns=['middle_income', 'binnedInc'], inplace=True)
The other non-numerical column is the Geopgraphy
column. In this case, this column is not very useful, given thta each location should only appear once. If the same location appears more than once, it is an indication of a duplicate value.
cancer_data['Geography'].value_counts()
Worth County, Missouri 2 Woodford County, Kentucky 2 Marion County, Texas 2 Boone County, Missouri 2 Adams County, North Dakota 2 .. Ray County, Missouri 1 Monroe County, Illinois 1 Johnson County, Tennessee 1 Ohio County, West Virginia 1 Carroll County, Mississippi 1 Name: Geography, Length: 2285, dtype: int64
cancer_data.duplicated().value_counts()
False 2285 True 766 dtype: int64
cancer_data.drop_duplicates(inplace=True)
cancer_data['Geography'].value_counts()
Wilkinson County, Mississippi 1 Norfolk County, Massachusetts 1 Boone County, Iowa 1 Elmore County, Idaho 1 Butler County, Kansas 1 .. Dickinson County, Iowa 1 Marion County, Georgia 1 Coosa County, Alabama 1 Dent County, Missouri 1 Carroll County, Mississippi 1 Name: Geography, Length: 2285, dtype: int64
cancer_data.drop(columns='Geography', inplace=True)
As was observed previously, there are some outliers in the target feature. Only the most extreme values will be removed, given that the points that are beyond the whisker endlines but not significantly far from them can help reduce the overfitting of the model.
q1 = cancer_data['TARGET_deathRate'].quantile(0.25)
q3 = cancer_data['TARGET_deathRate'].quantile(0.75)
iqr = q3 - q1
no_outlier = cancer_data.drop(index=(cancer_data[(cancer_data['TARGET_deathRate'] < q1 - 2.5*iqr) | (cancer_data['TARGET_deathRate'] > q3 + 2.5*iqr)].index))
sns.boxplot(data=no_outlier, x='TARGET_deathRate');
Separating the Input and Target Features of the data¶
After removing the outliers, it is possible to proceed to splitting the data into the input and target features.
from sklearn.model_selection import train_test_split
input_features = no_outlier.drop(columns='TARGET_deathRate')
output_features = no_outlier['TARGET_deathRate']
Splitting the data into Train and Test Sets¶
The input and output features are further split into train and test sets. This train set will be used to fit the model, while the test set will serve as a validation set, before predicting the values for the real test data.
The size for this validation set is selected to be 10% of the full training dataset, and the random_state
is set to ensure a consistent and reproducible split.
X_train, X_test, y_train, y_test = train_test_split(input_features, output_features, test_size=0.1, random_state=1)
Task 3¶
Building an intial Machine Learning Model¶
The first model that will be used is the Random Forest Regressor from the ensemble
module from scikitlearn
.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
rf_validation_pred = rf.predict(X_test)
Evaluating the model with various Evaluation Metrics¶
In order to evaluate the performance of the model, the Mean Absolute Error (MAE) and Mean Squared Error (MSE) metrics will be used.
from sklearn.metrics import mean_absolute_error, mean_squared_error
mean_absolute_error(y_test, rf_validation_pred)
12.390152838427944
mean_squared_error(y_test, rf_validation_pred)
262.0410070174672
The error from the model is not terrible, but it should be possible to get a better result with a different model, or different parameters.
Trying a second Machine Learning Model and Evaluating it¶
The next model that will be used is the Gradient Boosting Regressor, which is also part of the ensemble
module from scikitlearn
.
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='deprecated', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)
gbr_validation_pred = gbr.predict(X_test)
mean_squared_error(y_test, gbr_validation_pred)
240.95406717649632
The MSE from the Gradient Boosting model is slightly lower than the error obtained previously with the Random Forrest. Therefore, the Gradient Boosting Regressor will be used for the hyperparameter tuning and feature selection.
Task 4¶
Hyperparameter Tuning with Randomized Search Cross Validation¶
The Randomized Search Cross Validation will be used in order to identify the best combination of parameters out of different options from the n_estimators
, max_depth
, and learning_rate
parameters.
from sklearn.model_selection import RandomizedSearchCV
parameters = {
'n_estimators': [80, 90, 100, 125, 150, 200, 500, 900],
'max_depth': [2,3,4,5,8,16,None],
'learning_rate': [0.03, 0.1, 0.3, 0.5]
}
cv = RandomizedSearchCV(gbr, parameters, cv=5,n_iter=20)
cv.fit(X_train, y_train)
RandomizedSearchCV(cv=5, error_score=nan, estimator=GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_... subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False), iid='deprecated', n_iter=20, n_jobs=None, param_distributions={'learning_rate': [0.03, 0.1, 0.3, 0.5], 'max_depth': [2, 3, 4, 5, 8, 16, None], 'n_estimators': [80, 90, 100, 125, 150, 200, 500, 900]}, pre_dispatch='2*n_jobs', random_state=None, refit=True, return_train_score=False, scoring=None, verbose=0)
cv.best_params_
{'learning_rate': 0.3, 'max_depth': 3, 'n_estimators': 125}
Evaluating the model¶
mean_squared_error(y_test, cv.predict(X_test))
225.2199933222498
The MSE obtained with the best parameters that were found through the cross validation is the lowest that has been obtained thus far. However, it is possible that by removing unnecessary features from the dataset, the performance of the model will improve more.
Task 5¶
Feature Selection with Boruta¶
The Boruta algorithm will be used to identify the relevant features of the dataset, in order to remove the unnecesarry ones, and improve the model.
!pip install boruta
Collecting boruta Downloading https://files.pythonhosted.org/packages/b2/11/583f4eac99d802c79af9217e1eff56027742a69e6c866b295cce6a5a8fc2/Boruta-0.3-py3-none-any.whl (56kB) |████████████████████████████████| 61kB 4.3MB/s Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from boruta) (1.19.4) Requirement already satisfied: scikit-learn>=0.17.1 in /usr/local/lib/python3.6/dist-packages (from boruta) (0.22.2.post1) Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from boruta) (1.4.1) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.17.1->boruta) (1.0.0) Installing collected packages: boruta Successfully installed boruta-0.3
from boruta import BorutaPy
The boruta selector will be fit using the Gradient Boosting Regressor with the parameters found through the Randomized Search cross validation.
gbr_cv = GradientBoostingRegressor(learning_rate=0.1, max_depth=4, n_estimators=900)
boruta_selector = BorutaPy(gbr_cv, n_estimators='auto', verbose=2, random_state=1)
boruta_selector.fit(np.array(X_train), np.array(y_train))
Iteration: 1 / 100 Confirmed: 0 Tentative: 30 Rejected: 0 Iteration: 2 / 100 Confirmed: 0 Tentative: 30 Rejected: 0 Iteration: 3 / 100 Confirmed: 0 Tentative: 30 Rejected: 0 Iteration: 4 / 100 Confirmed: 0 Tentative: 30 Rejected: 0 Iteration: 5 / 100 Confirmed: 0 Tentative: 30 Rejected: 0 Iteration: 6 / 100 Confirmed: 0 Tentative: 30 Rejected: 0 Iteration: 7 / 100 Confirmed: 0 Tentative: 30 Rejected: 0 Iteration: 8 / 100 Confirmed: 18 Tentative: 4 Rejected: 8 Iteration: 9 / 100 Confirmed: 18 Tentative: 4 Rejected: 8 Iteration: 10 / 100 Confirmed: 18 Tentative: 4 Rejected: 8 Iteration: 11 / 100 Confirmed: 18 Tentative: 4 Rejected: 8 Iteration: 12 / 100 Confirmed: 18 Tentative: 4 Rejected: 8 Iteration: 13 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 14 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 15 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 16 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 17 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 18 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 19 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 20 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 21 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 22 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 23 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 24 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 25 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 26 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 27 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 28 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 29 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 30 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 31 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 32 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 33 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 34 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 35 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 36 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 37 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 38 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 39 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 40 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 41 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 42 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 43 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 44 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 45 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 46 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 47 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 48 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 49 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 50 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 51 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 52 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 53 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 54 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 55 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 56 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 57 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 58 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 59 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 60 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 61 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 62 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 63 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 64 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 65 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 66 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 67 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 68 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 69 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 70 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 71 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 72 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 73 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 74 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 75 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 76 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 77 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 78 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 79 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 80 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 81 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 82 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 83 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 84 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 85 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 86 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 87 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 88 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 89 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 90 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 91 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 92 / 100 Confirmed: 18 Tentative: 3 Rejected: 9 Iteration: 93 / 100 Confirmed: 18 Tentative: 2 Rejected: 10 Iteration: 94 / 100 Confirmed: 18 Tentative: 2 Rejected: 10 Iteration: 95 / 100 Confirmed: 18 Tentative: 2 Rejected: 10 Iteration: 96 / 100 Confirmed: 18 Tentative: 2 Rejected: 10 Iteration: 97 / 100 Confirmed: 18 Tentative: 2 Rejected: 10 Iteration: 98 / 100 Confirmed: 18 Tentative: 2 Rejected: 10 Iteration: 99 / 100 Confirmed: 18 Tentative: 2 Rejected: 10 BorutaPy finished running. Iteration: 100 / 100 Confirmed: 18 Tentative: 1 Rejected: 10
BorutaPy(alpha=0.05, estimator=GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=4, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=158, n_iter_no_change=None, presort='deprecated', random_state=RandomState(MT19937) at 0x7F13DE3E6EB8, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False), max_iter=100, n_estimators='auto', perc=100, random_state=RandomState(MT19937) at 0x7F13DE3E6EB8, two_step=True, verbose=2)
From the 30 columns that where inspected, 11 where rejected and 18 where denoted as important. We can visualize the features that were dientified as important, using the ranking_
attribute of the boruta selector.
pd.DataFrame(data={'Feature' : X_train.columns, 'Ranking' : boruta_selector.ranking_}).sort_values(by='Ranking')
Feature | Ranking | |
---|---|---|
0 | avgAnnCount | 1 |
18 | PctUnemployed16_Over | 1 |
17 | PctEmployed16_Over | 1 |
16 | PctBachDeg25_Over | 1 |
15 | PctHS25_Over | 1 |
28 | PctMarriedHouseholds | 1 |
13 | PctHS18_24 | 1 |
19 | PctPrivateCoverage | 1 |
9 | MedianAgeFemale | 1 |
10 | AvgHouseholdSize | 1 |
27 | PctOtherRace | 1 |
5 | povertyPercent | 1 |
4 | popEst2015 | 1 |
3 | medIncome | 1 |
2 | incidenceRate | 1 |
1 | avgDeathsPerYear | 1 |
25 | PctBlack | 1 |
23 | PctPublicCoverageAlone | 1 |
26 | PctAsian | 2 |
11 | PercentMarried | 3 |
24 | PctWhite | 4 |
14 | PctBachDeg18_24 | 5 |
12 | PctNoHS18_24 | 6 |
29 | BirthRate | 6 |
22 | PctPublicCoverage | 8 |
21 | PctEmpPrivCoverage | 9 |
6 | studyPerCap | 10 |
8 | MedianAgeMale | 11 |
20 | PctPrivateCoverageAlone | 12 |
7 | MedianAge | 12 |
The training input features are adjusted and transformed based on these results from the boruta selector.
X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test))
Now it is possible to train the Gradient Boosting Regressor with the parameters found through the hyperparameter tuning and using only the relevant features, as identified by the boruta selector.
gbr_important = GradientBoostingRegressor(learning_rate=0.1, max_depth=3, n_estimators=900)
gbr_important.fit(X_important_train, y_train)
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.1, loss='ls', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=900, n_iter_no_change=None, presort='deprecated', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)
mean_squared_error(y_test, gbr_important.predict(X_important_test))
171.78772711841128
The MSE obtained is the lowest error from all of the models that have been used. Now, it is possible to perform predictions on the real test data.
Task 6¶
Final prediction for submission¶
The test data is loaded and the same data preparation steps that were performed on the training data (filling the missing values, dropping the columns, applying the boruta selector transformation) are performed.
After preparing the test data, the Gradient Boosting Regressor is applied to it, in order to obtain the final predictions.
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/cancer_death_rate/Testing_set_label.csv')
test_data.drop(columns=['binnedInc','PctSomeCol18_24','Geography'],inplace=True)
test_data.fillna(value=test_data.mean(), inplace=True)
test_data.duplicated().value_counts()
False 762 dtype: int64
important_test_data = boruta_selector.transform(np.array(test_data))
gbr_important_pred = gbr_important.predict(important_test_data)
submission = pd.DataFrame(gbr_important_pred, columns=['prediction'])
submission.to_csv('gbr_hyper.csv', index=False)