The Santander Customer Competition
Essay by vvghelu • July 22, 2016 • Research Paper • 1,459 Words (6 Pages) • 1,267 Views
Santander Customer Satisfaction
March 28, 2016
1 Tabel of Contents
Intro
Results
Appendix
{ Data
{ Principal Component Analysis
{ Feature selection
{ Fitting the model
{ NearestNeighbour
{ LogisticClassier
{ XGboost
{ RandomForestClassier
{ ExtraTreesClassier
Evan's suggestions
Model tunning
Kaggle submission
Resources
Authors
2
2 Intro
The Santander Customer Competition on Kaggle provides us with a synthetic data set with 370 numerical
variables. Using those variables the task is to predict whether a customer is satised or not. The evaluation
metric is ROC AUC.
3 Results
The data set contains synthetic data, i.e. it is anonymized. There were duplicate columns and columns with
zero variance (standard deviation is zero) which we removed. This step reduced the number of independent
variables from 370 to 308.
Using principal components analysis we can see in the plot (see appendix) that the two clusters of
customers, satised and not, overlap quite a lot which makes it more dicult for the classiers to perform
well. Five principal components explain 96% percents of data (see in appendix). At this point we train the
classiers on the original data after performing feature selection.
For feature selection we opted for a randomized trees classier, aka extra trees, that computes the
importance coecients of the features that are used for feature selection. We end up with 36 important
features. This step reduced the number of indepent variables from 308 to 36. We use the 36 features to train
our classiers.
We tried dierent models to see which performs the best with the intent to concentrate on the promising
candidates and improve them by tunning parameters through statistical analysis. We expect that gradient
tree boosting implemented in xgboost python module will be the best model since it is an ensemble method.
At the end the xgboost classier performed the best with the ROC area under the curve of 0.838 . For
comparison, the best score on kaggle leaderboard is 0.842 as of 3/14/2016. Unfortunately, our score puts us
into the top 800, so there is room for improvement.
After tunning our XGBoost model and running it on three datasets with randomly sampled variables we
were able to improve our score on Kaggle to 0.83905 which put us into the top 400 as of 3/10/2016. The
competition is open until May 2nd and we working on improving our model. At this point we present our
best solution so far. Going forward we will use more sophisticated sampling methods and a mix of classiers
in addition to XGBoost like AdaBoost and ExtraTreestClassier to make sure there is no overtting on the
training set. We are exploring the use of neural nets as well, however, to achieve signicant improvements
in the ROC AUC score we believe we need to focus on feature engineering.
4 Reproducibility
All code used to for this project is provided in the appendix. Additionaly, you can nd the IPython notebook
version of this write up at https://goo.gl/654D3o . The online version has helpful links.
3
5 Appendix
5.1 Data
First, we read in data.
In [ ]: import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
In [ ]: train.iloc[:,0:5].head()
In [3]: train.iloc[:,0:5].describe()
Out[3]: ID var3 var15 imp ent var16 ult1 n
count 76020.000000 76020.000000 76020.000000 76020.000000
mean 75964.050723 -1523.199277 33.212865 86.208265
std 43781.947379 39033.462364 12.956486 1614.757313
min 1.000000 -999999.000000 5.000000 0.000000
25% 38104.750000 2.000000 23.000000 0.000000
50% 76043.000000 2.000000 28.000000 0.000000
75% 113748.750000 2.000000 40.000000 0.000000
max 151838.000000 238.000000 105.000000 210000.000000
imp op var39 comer ult1
count 76020.000000
mean 72.363067
std 339.315831
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 12888.030000
In [121]: # Number of rows and columns including dependent variable TARGET
train.shape
Out[121]:
...
...