Data Sciencekaggle

Kaggle Competition | Titanic Machine Learning from Disaster – Part II

 

Kaggle Competition | Titanic Machine Learning from Disaster – Part II

This is the second post on this argument. I wasn’t able to achieve a better result with respect to this post (0.81340) http://elenacuoco.altervista.org/blog/archives/847 but I decided to write a post using the ipython notebook style. I discovered a lot of post and ipython notebook about the same competition. Here some of them which I used to take useful suggestions for the ipython notebook:

This is the starting description of the competition:

Kaggle Competition | Titanic Machine Learning from Disaster

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this contest, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning.” From the competition homepage.


Now, let’s go to the code! I started using pandas, patsy and DataFrame containers
#import library to read and plot the data

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from patsy import dmatrices,dmatrix
 
 
#read data using pandas library
df=pd.read_csv('./data/train.csv') 
df
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female3810PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S
4503Allen, Mr. William Henrymale35003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale23134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female270234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female141023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale411PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale580011378326.5500C103S
121303Saundercock, Mr. William Henrymale2000A/5. 21518.0500NaNS
131403Andersson, Mr. Anders Johanmale391534708231.2750NaNS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14003504067.8542NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female550024870616.0000NaNS
161703Rice, Master. Eugenemale24138265229.1250NaNQ
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande…female311034576318.0000NaNS
192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
202102Fynney, Mr. Joseph Jmale350023986526.0000NaNS
212212Beesley, Mr. Lawrencemale340024869813.0000D56S
222313McGowan, Miss. Anna “Annie”female15003309238.0292NaNQ
232411Sloper, Mr. William Thompsonmale280011378835.5000A6S
242503Palsson, Miss. Torborg Danirafemale83134990921.0750NaNS
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia…female381534707731.3875NaNS
262703Emir, Mr. Farred ChehabmaleNaN0026317.2250NaNC
272801Fortune, Mr. Charles Alexandermale193219950263.0000C23 C25 C27S
282913O’Dwyer, Miss. Ellen “Nellie”femaleNaN003309597.8792NaNQ
293003Todoroff, Mr. LaliomaleNaN003492167.8958NaNS
86186202Giles, Mr. Frederick Edwardmale21102813411.5000NaNS
86286311Swift, Mrs. Frederick Joel (Margaret Welles Ba…female48001746625.9292D17S
86386403Sage, Miss. Dorothy Edith “Dolly”femaleNaN82CA. 234369.5500NaNS
86486502Gill, Mr. John Williammale240023386613.0000NaNS
86586612Bystrom, Mrs. (Karolina)female420023685213.0000NaNS
86686712Duran y More, Miss. Asuncionfemale2710SC/PARIS 214913.8583NaNC
86786801Roebling, Mr. Washington Augustus IImale3100PC 1759050.4958A24S
86886903van Melkebeke, Mr. PhilemonmaleNaN003457779.5000NaNS
86987013Johnson, Master. Harold Theodormale41134774211.1333NaNS
87087103Balkic, Mr. Cerinmale26003492487.8958NaNS
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47111175152.5542D35S
87287301Carlsson, Mr. Frans Olofmale33006955.0000B51 B53 B55S
87387403Vander Cruyssen, Mr. Victormale47003457659.0000NaNS
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female2810P/PP 338124.0000NaNC
87587613Najib, Miss. Adele Kiamie “Jane”female150026677.2250NaNC
87687703Gustafsson, Mr. Alfred Ossianmale200075349.8458NaNS
87787803Petroff, Mr. Nedeliomale19003492127.8958NaNS
87887903Laleff, Mr. KristomaleNaN003492177.8958NaNS
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56011176783.1583C50C
88088112Shelley, Mrs. William (Imanita Parrish Hall)female250123043326.0000NaNS
88188203Markun, Mr. Johannmale33003492577.8958NaNS
88288303Dahlberg, Miss. Gerda Ulrikafemale2200755210.5167NaNS
88388402Banfield, Mr. Frederick Jamesmale2800C.A./SOTON 3406810.5000NaNS
88488503Sutehall, Mr. Henry Jrmale2500SOTON/OQ 3920767.0500NaNS
88588603Rice, Mrs. William (Margaret Norton)female390538265229.1250NaNQ
88688702Montvila, Rev. Juozasmale270021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale190011205330.0000B42S
88888903Johnston, Miss. Catherine Helen “Carrie”femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale260011136930.0000C148C
89089103Dooley, Mr. Patrickmale32003703767.7500NaNQ

891 rows × 12 columns

df.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
Reading the info of the DataFrame, it is evident that the Cabin feature has a lot of null values. So, it will not be useful for our prediction. It is better to drop it down.
The Age feature has a great number of null values, but we suppose that the age of a passenger can be very relevant for his destiny in this disaster. 177 passengers have null Age values. To take advantages of a full set for this features, it will be useful to make imputation for the missing values.
Now let’s have a deeper look at the data, trying to understand which could be the most important features for our model and let’s take a quick look at our data graphically:
fig = plt.figure(figsize=(20,6), dpi=1600 )
a=0.8
ax1 = fig.add_subplot(3,2,1) 
female = df.Survived[df.Sex == 'female'][df['Age'].isnull()].value_counts()
female.plot(kind='bar', label='female, age is null',color='red', alpha=a)
ax1.set_xlim(-1, len(female))
plt.legend(loc='best')
ax2 = fig.add_subplot(3,2,2) 
male= df.Survived[df.Sex == 'male'][df['Age'].isnull()].value_counts()
male.plot(kind='bar', label='male,age is null', alpha=a, color='green')
ax2.set_xlim(-1, len(male))
plt.legend(loc='best')
 
ax3 = fig.add_subplot(3,2,3) 
df.Age[df.Sex == 'female'][df.Survived==1] .dropna().hist(bins=16, range=(0,80), alpha = .5)
ax3.set_title('female age dist,survived')
 
ax4 = fig.add_subplot(3,2,4)
df.Age[df.Sex == 'male'][df.Survived==1] .dropna().hist(bins=16, range=(0,80), alpha = .5)
ax4.set_title('male age dist,survived') 
 
subplots_adjust(bottom=0.1, right=0.8, top=2)
ax5 = fig.add_subplot(3,2,5) 
df.Age[df.Sex == 'female'][df.Survived==0] .dropna().hist(bins=16, range=(0,80), alpha = .5)
ax5.set_title('female age dist,died')
 
ax6 = fig.add_subplot(3,2,6) 
df.Age[df.Sex == 'male'][df.Survived==0] .dropna().hist(bins=16, range=(0,80), alpha = .5)
ax6.set_title('male age dist,died')
plt.show()
 
 
fig = plt.figure(figsize=(18,12), dpi=1600)
a=0.8
 
##gender and class
ax3 = fig.add_subplot(545)
female_highclass = df.Survived[df.Sex == 'female'][df.Pclass != 3].value_counts()
female_highclass.plot(kind='bar', label='female highclass', color='pink', alpha=a)
 
ax3.set_xlim(-1, len(female_highclass))
plt.legend(loc='best')
ax4 = fig.add_subplot(546)
female_lowclass = df.Survived[df.Sex == 'female'][df.Pclass == 3].value_counts()
female_lowclass.plot(kind='bar', label='female, low class', color='pink', alpha=a)
ax4.set_xlim(-1, len(female_lowclass))
plt.legend(loc='best')

ax5 = fig.add_subplot(5,4,7) 
male_lowclass = df.Survived[df.Sex == 'male'][df.Pclass == 3].value_counts()
male_lowclass.plot(kind='bar', label='male, low class',color='lightblue', alpha=a)
ax5.set_xlim(-1, len(male_lowclass))
plt.legend(loc='best')
ax6 = fig.add_subplot(5,4,8) 
male_highclass = df.Survived[df.Sex == 'male'][df.Pclass != 3].value_counts()
male_highclass.plot(kind='bar', label='male highclass', alpha=a, color='lightblue')
 
ax6.set_xlim(-1, len(male_highclass))
plt.legend(loc='best')
 
##gender and age
#female
ax7 = fig.add_subplot(5,4,9)
female_aged = df.Survived[df.Sex == 'female'][df.Age >= 60].value_counts()
female_aged.plot(kind='bar', label='female, aged', color='pink', alpha=a)
 
ax7.set_xlim(-1, len(female_aged))
plt.legend(loc='best')
ax8 = fig.add_subplot(5,4,10)
female_child = df.Survived[df.Sex == 'female'][df.Age <= 10].value_counts()
female_child.plot(kind='bar', label='female, children', color='pink', alpha=a)
 
ax8.set_xlim(-1, len(female_child))
plt.legend(loc='best')
ax9 = fig.add_subplot(5,4,11)
female_middleage = df.Survived[df.Sex == 'female'][df.Age>10][df.Age<=30].value_counts()
female_middleage.plot(kind='bar', label='female, middle age(10-30)', color='pink', alpha=a)
 
ax9.set_xlim(-1, len(female_middleage))
plt.legend(loc='best')
ax9 = fig.add_subplot(5,4,12)
female_middleage = df.Survived[df.Sex == 'female'][df.Age>30][df.Age<60].value_counts()
female_middleage.plot(kind='bar', label='female, middle age (30-60)', color='pink', alpha=a)
 
ax9.set_xlim(-1, len(female_middleage))
plt.legend(loc='best')
 
#male
 
ax10 = fig.add_subplot(5,4,13)
male_aged = df.Survived[df.Sex == 'male'][df.Age >= 60].value_counts()
male_aged.plot(kind='bar', label='male, aged', color='blue', alpha=a)
 
ax10.set_xlim(-1, len(male_aged))
plt.legend(loc='best')
ax11 = fig.add_subplot(5,4,14)
male_child = df.Survived[df.Sex == 'male'][df.Age <= 10].value_counts()
male_child.plot(kind='bar', label='male, children', color='blue', alpha=a)
 
ax11.set_xlim(-1, len(male_child))
plt.legend(loc='best')
ax12 = fig.add_subplot(5,4,15)
male_middleage = df.Survived[df.Sex == 'male'][df.Age>10][df.Age<=30].value_counts()
male_middleage.plot(kind='bar', label='male, middle age (10-30)', color='blue', alpha=a)
 
ax12.set_xlim(-1, len(male_middleage))
plt.legend(loc='best') 
ax12 = fig.add_subplot(5,4,16)
male_middleage = df.Survived[df.Sex == 'male'][df.Age>30][df.Age<60].value_counts()
male_middleage.plot(kind='bar', label='male, middle age (30-60)', color='blue', alpha=a)
 
ax12.set_xlim(-1, len(male_middleage))
plt.legend(loc='best') 
<matplotlib.legend.Legend at 0x571d790>

Very interesting!!! If you are female and aged you have Provability=1 to survive. If you are highclass, male and female, you have a high probability to survive. If you are children you have more or less the same surviving probability, gender indipendent.

fig = plt.figure(figsize=(18,6), dpi=1600)
ax1 = fig.add_subplot(121)
df.Survived[df.Pclass == 3].value_counts().plot(kind='bar',label='LowClass',color='red')
ax1.set_xlim(-1, 2)
plt.legend(loc='best')
ax2 = fig.add_subplot(122) 
df.Survived[df.Pclass != 3].value_counts().plot(kind='bar',label='HighClass',color='green')
ax2.set_xlim(-1, 2)
plt.legend(loc='best')
fig = plt.figure(figsize=(18,6), dpi=1600)
ax1 = fig.add_subplot(221)
df.Survived[df.Fare <=300].value_counts().plot(kind='bar',label='Fare <=1',color='red')
ax1.set_xlim(-1, 2)
plt.legend(loc='best')
ax2 = fig.add_subplot(222) 
df.Survived[df.Fare >300].value_counts().plot(kind='bar',label='Fare>300',color='green')
ax2.set_xlim(-1, 2)
plt.legend(loc='best')
So, as we could have imagined, being women, high-class is a good passport for surviving
for i in range(1,4):
    print('Male:'), i, len(df[ (df['Sex'] == 'male') & (df['Pclass'] == i) ])
    print('Female:'), i, len(df[ (df['Sex'] == 'female') & (df['Pclass'] == i) ])
Male, low class–>high dead probability

Data cleaning:

Now we could clean, impute, encode the data to have a better model for our prediction. What it is important is the imputation of age. We realize now that if you are aged and female, you will survive. If you are high class and aged are more probabible to survive. We now are going to translate these conclusion in the model
After reading the Best Practices https://www.kaggle.com/wiki/ModelSubmissionBestPractices , I reorganized the code following those suggestions.

This is my Training code:

#Titatic competitor usign pandas and scikit library
import numpy as np
import pandas as pd
from pandas import  DataFrame
from patsy import dmatrices
import string
from operator import itemgetter
#json library for settings file
import json
# import the machine learning library that holds the randomforest
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split,StratifiedShuffleSplit,StratifiedKFold
from sklearn import preprocessing
from sklearn.metrics import classification_report
#joblib library for serialization
from sklearn.externals import joblib

##Read configuration parameters

file_dir = './titanic-SETTINGS.json'

config = json.loads(open(file_dir).read())
train_file=config["HOME"]+config["TRAIN_DATA_PATH"]+'train.csv'
MODEL_PATH=config["HOME"]+config["MODEL_PATH"]
test_file=config["HOME"]+config["TEST_DATA_PATH"]+'test.csv'
SUBMISSION_PATH=config["HOME"]+config["SUBMISSION_PATH"]
seed= int(config["SEED"])

print train_file,seed

# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")

###utility to clean and munge data
def substrings_in_string(big_string, substrings):
    for substring in substrings:
        if string.find(big_string, substring) != -1:
            return substring
    print big_string
    return np.nan

le = preprocessing.LabelEncoder()
enc=preprocessing.OneHotEncoder()

def clean_and_munge_data(df):
    #setting silly values to nan
    df.Fare = df.Fare.map(lambda x: np.nan if x==0 else x)
    #creating a title column from name
    title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                'Don', 'Jonkheer']
    df['Title']=df['Name'].map(lambda x: substrings_in_string(x, title_list))

    #replacing all titles with mr, mrs, miss, master
    def replace_titles(x):
        title=x['Title']
        if title in ['Mr','Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
            return 'Mr'
        elif title in ['Master']:
            return 'Master'
        elif title in ['Countess', 'Mme','Mrs']:
            return 'Mrs'
        elif title in ['Mlle', 'Ms','Miss']:
            return 'Miss'
        elif title =='Dr':
            if x['Sex']=='Male':
                return 'Mr'
            else:
                return 'Mrs'
        elif title =='':
            if x['Sex']=='Male':
                return 'Master'
            else:
                return 'Miss'
        else:
            return title

    df['Title']=df.apply(replace_titles, axis=1)

    #Creating new family_size column
    df['Family_Size']=df['SibSp']+df['Parch']
    df['Family']=df['SibSp']*df['Parch']


    #imputing nan values
    df.loc[ (df.Fare.isnull())&(df.Pclass==1),'Fare'] =np.median(df[df['Pclass'] == 1]['Fare'].dropna())
    df.loc[ (df.Fare.isnull())&(df.Pclass==2),'Fare'] =np.median( df[df['Pclass'] == 2]['Fare'].dropna())
    df.loc[ (df.Fare.isnull())&(df.Pclass==3),'Fare'] = np.median(df[df['Pclass'] == 3]['Fare'].dropna())

    df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

    df['AgeFill']=df['Age']
    mean_ages = np.zeros(4)
    mean_ages[0]=np.average(df[df['Title'] == 'Miss']['Age'].dropna())
    mean_ages[1]=np.average(df[df['Title'] == 'Mrs']['Age'].dropna())
    mean_ages[2]=np.average(df[df['Title'] == 'Mr']['Age'].dropna())
    mean_ages[3]=np.average(df[df['Title'] == 'Master']['Age'].dropna())
    df.loc[ (df.Age.isnull()) & (df.Title == 'Miss') ,'AgeFill'] = mean_ages[0]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Mrs') ,'AgeFill'] = mean_ages[1]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Mr') ,'AgeFill'] = mean_ages[2]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Master') ,'AgeFill'] = mean_ages[3]

    df['AgeCat']=df['AgeFill']
    df.loc[ (df.AgeFill<=10) ,'AgeCat'] = 'child'
    df.loc[ (df.AgeFill>60),'AgeCat'] = 'aged'
    df.loc[ (df.AgeFill>10) & (df.AgeFill <=30) ,'AgeCat'] = 'adult'
    df.loc[ (df.AgeFill>30) & (df.AgeFill <=60) ,'AgeCat'] = 'senior'

    df.Embarked = df.Embarked.fillna('S')


    #Special case for cabins as nan may be signal
    df.loc[ df.Cabin.isnull()==True,'Cabin'] = 0.5
    df.loc[ df.Cabin.isnull()==False,'Cabin'] = 1.5
   #Fare per person

    df['Fare_Per_Person']=df['Fare']/(df['Family_Size']+1)

    #Age times class

    df['AgeClass']=df['AgeFill']*df['Pclass']
    df['ClassFare']=df['Pclass']*df['Fare_Per_Person']


    df['HighLow']=df['Pclass']
    df.loc[ (df.Fare_Per_Person<8) ,'HighLow'] = 'Low'
    df.loc[ (df.Fare_Per_Person>=8) ,'HighLow'] = 'High'



    le.fit(df['Sex'] )
    x_sex=le.transform(df['Sex'])
    df['Sex']=x_sex.astype(np.float)

    le.fit( df['Ticket'])
    x_Ticket=le.transform( df['Ticket'])
    df['Ticket']=x_Ticket.astype(np.float)

    le.fit(df['Title'])
    x_title=le.transform(df['Title'])
    df['Title'] =x_title.astype(np.float)

    le.fit(df['HighLow'])
    x_hl=le.transform(df['HighLow'])
    df['HighLow']=x_hl.astype(np.float)


    le.fit(df['AgeCat'])
    x_age=le.transform(df['AgeCat'])
    df['AgeCat'] =x_age.astype(np.float)

    le.fit(df['Embarked'])
    x_emb=le.transform(df['Embarked'])
    df['Embarked']=x_emb.astype(np.float)

    df = df.drop(['PassengerId','Name','Age','Cabin'], axis=1) #remove Name,Age and PassengerId


    return df

#read data
traindf=pd.read_csv(train_file)
##clean data
df=clean_and_munge_data(traindf)
########################################formula################################
 
formula_ml='Survived~Pclass+C(Title)+Sex+C(AgeCat)+Fare_Per_Person+Fare+Family_Size' 

y_train, x_train = dmatrices(formula_ml, data=df, return_type='dataframe')
y_train = np.asarray(y_train).ravel()
print y_train.shape,x_train.shape

##select a train and test set
X_train, X_test, Y_train, Y_test = train_test_split(x_train, y_train, test_size=0.2,random_state=seed)
#instantiate and fit our model

clf=RandomForestClassifier(n_estimators=500, criterion='entropy', max_depth=5, min_samples_split=1,
  min_samples_leaf=1, max_features='auto',    bootstrap=False, oob_score=False, n_jobs=1, random_state=seed,
  verbose=0, min_density=None, compute_importances=None)

###compute grid search to find best paramters for pipeline
param_grid = dict( )
##classify pipeline
pipeline=Pipeline([ ('clf',clf) ])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=3,scoring='accuracy',
cv=StratifiedShuffleSplit(Y_train, n_iter=10, test_size=0.2, train_size=None, indices=None, 
random_state=seed, n_iterations=None)).fit(X_train, Y_train)
# Score the results
###print result
print("Best score: %0.3f" % grid_search.best_score_)
print(grid_search.best_estimator_)
report(grid_search.grid_scores_)
 
print('-----grid search end------------')
print ('on all train set')
scores = cross_val_score(grid_search.best_estimator_, x_train, y_train,cv=3,scoring='accuracy')
print scores.mean(),scores
print ('on test set')
scores = cross_val_score(grid_search.best_estimator_, X_test, Y_test,cv=3,scoring='accuracy')
print scores.mean(),scores

# Score the results

print(classification_report(Y_train, grid_search.best_estimator_.predict(X_train) ))
print('test data')
print(classification_report(Y_test, grid_search.best_estimator_.predict(X_test) ))

#serialize training
model_file=MODEL_PATH+'model-rf.pkl'
joblib.dump(grid_search.best_estimator_, model_file)

This is my Prediction code:

#Titatic competitor usign pandas and scikit library
import numpy as np
import pandas as pd
from pandas import  DataFrame
from patsy import dmatrices
import string
from operator import itemgetter
#json library for settings file
import json
# import the machine learning library that holds the randomforest
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split,StratifiedShuffleSplit,StratifiedKFold
from sklearn import preprocessing
#joblib library for serialization
from sklearn.externals import joblib

#Read data and configuration parameters#

file_dir = './titanic-SETTINGS.json'

config = json.loads(open(file_dir).read())
train_file=config["HOME"]+config["TRAIN_DATA_PATH"]+'train.csv'
MODEL_PATH=config["HOME"]+config["MODEL_PATH"]
test_file=config["HOME"]+config["TEST_DATA_PATH"]+'test.csv'
SUBMISSION_PATH=config["HOME"]+config["SUBMISSION_PATH"]
seed= int(config["SEED"])


print test_file, seed
# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")

###utility to clean and munge data
def substrings_in_string(big_string, substrings):
    for substring in substrings:
        if string.find(big_string, substring) != -1:
            return substring
    print big_string
    return np.nan

le = preprocessing.LabelEncoder()
enc=preprocessing.OneHotEncoder()

def clean_and_munge_data(df):
    #setting silly values to nan
    df.Fare = df.Fare.map(lambda x: np.nan if x==0 else x)
    #creating a title column from name
    title_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                'Don', 'Jonkheer']
    df['Title']=df['Name'].map(lambda x: substrings_in_string(x, title_list))

    #replacing all titles with mr, mrs, miss, master
    def replace_titles(x):
        title=x['Title']
        if title in ['Mr','Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
            return 'Mr'
        elif title in ['Master']:
            return 'Master'
        elif title in ['Countess', 'Mme','Mrs']:
            return 'Mrs'
        elif title in ['Mlle', 'Ms','Miss']:
            return 'Miss'
        elif title =='Dr':
            if x['Sex']=='Male':
                return 'Mr'
            else:
                return 'Mrs'
        elif title =='':
            if x['Sex']=='Male':
                return 'Master'
            else:
                return 'Miss'
        else:
            return title

    df['Title']=df.apply(replace_titles, axis=1)

    #Creating new family_size column
    df['Family_Size']=df['SibSp']+df['Parch']
    df['Family']=df['SibSp']*df['Parch']


    #imputing nan values
    df.loc[ (df.Fare.isnull())&(df.Pclass==1),'Fare'] =np.median(df[df['Pclass'] == 1]['Fare'].dropna())
    df.loc[ (df.Fare.isnull())&(df.Pclass==2),'Fare'] =np.median( df[df['Pclass'] == 2]['Fare'].dropna())
    df.loc[ (df.Fare.isnull())&(df.Pclass==3),'Fare'] = np.median(df[df['Pclass'] == 3]['Fare'].dropna())

    df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

    df['AgeFill']=df['Age']
    mean_ages = np.zeros(4)
    mean_ages[0]=np.average(df[df['Title'] == 'Miss']['Age'].dropna())
    mean_ages[1]=np.average(df[df['Title'] == 'Mrs']['Age'].dropna())
    mean_ages[2]=np.average(df[df['Title'] == 'Mr']['Age'].dropna())
    mean_ages[3]=np.average(df[df['Title'] == 'Master']['Age'].dropna())
    df.loc[ (df.Age.isnull()) & (df.Title == 'Miss') ,'AgeFill'] = mean_ages[0]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Mrs') ,'AgeFill'] = mean_ages[1]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Mr') ,'AgeFill'] = mean_ages[2]
    df.loc[ (df.Age.isnull()) & (df.Title == 'Master') ,'AgeFill'] = mean_ages[3]

    df['AgeCat']=df['AgeFill']
    df.loc[ (df.AgeFill<=10) ,'AgeCat'] = 'child'
    df.loc[ (df.AgeFill>60),'AgeCat'] = 'aged'
    df.loc[ (df.AgeFill>10) & (df.AgeFill <=30) ,'AgeCat'] = 'adult'
    df.loc[ (df.AgeFill>30) & (df.AgeFill <=60) ,'AgeCat'] = 'senior'

    df.Embarked = df.Embarked.fillna('S')


    #Special case for cabins as nan may be signal
    df.loc[ df.Cabin.isnull()==True,'Cabin'] = 0.5
    df.loc[ df.Cabin.isnull()==False,'Cabin'] = 1.5
   #Fare per person

    df['Fare_Per_Person']=df['Fare']/(df['Family_Size']+1)

    #Age times class

    df['AgeClass']=df['AgeFill']*df['Pclass']
    df['ClassFare']=df['Pclass']*df['Fare_Per_Person']


    df['HighLow']=df['Pclass']
    df.loc[ (df.Fare_Per_Person<8) ,'HighLow'] = 'Low'
    df.loc[ (df.Fare_Per_Person>=8) ,'HighLow'] = 'High'



    le.fit(df['Sex'] )
    x_sex=le.transform(df['Sex'])
    df['Sex']=x_sex.astype(np.float)

    le.fit( df['Ticket'])
    x_Ticket=le.transform( df['Ticket'])
    df['Ticket']=x_Ticket.astype(np.float)

    le.fit(df['Title'])
    x_title=le.transform(df['Title'])
    df['Title'] =x_title.astype(np.float)

    le.fit(df['HighLow'])
    x_hl=le.transform(df['HighLow'])
    df['HighLow']=x_hl.astype(np.float)


    le.fit(df['AgeCat'])
    x_age=le.transform(df['AgeCat'])
    df['AgeCat'] =x_age.astype(np.float)

    le.fit(df['Embarked'])
    x_emb=le.transform(df['Embarked'])
    df['Embarked']=x_emb.astype(np.float)

    df = df.drop(['PassengerId','Name','Age','Cabin'], axis=1) #remove Name,Age and PassengerId


    return df

#read data

testdf=pd.read_csv(test_file)

ID=testdf['PassengerId']
##clean data
df_test=clean_and_munge_data(testdf)
df_test['Survived'] =  [0 for x in range(len(df_test))]

print df_test.shape
########################################formula################################
 
formula_ml='Survived~Pclass+C(Title)+Sex+C(AgeCat)+Fare_Per_Person+Fare+Family_Size' 

y_p,x_test = dmatrices(formula_ml, data=df_test, return_type='dataframe')
y_p = np.asarray(y_p).ravel()
print  y_p.shape,x_test.shape
#serialize training
model_file=MODEL_PATH+'model-rf.pkl'
clf = joblib.load(model_file)
####estimate prediction on test data set
y_p=clf.predict(x_test).astype(int)
print y_p.shape

outfile=SUBMISSION_PATH+'prediction-BS.csv'
dfjo = DataFrame(dict(Survived=y_p,PassengerId=ID), columns=['Survived','PassengerId'])
dfjo.to_csv(outfile,index_label=None,index_col=False,index=False)


#

Few considerations

Be aware in selecting a seed in order to reproduce your results!

Play a bit with paramaters in the gridsearch to got better results.

With the above configuration I was able to reach 0.81340, but never I could obtain a better results with these algorithms and features selection in the leaderboard even with very high accuracy in my tests!

I have to thank all the people who wrote Tutorials in the Kaggle Forum, the various post in different blog which helped me in handling the python notebbok, the pandas library, matplotlib and the beautiful scikit-learn package (http://scikit-learn.org/stable/)

[/raw]



 

You can find the ipython notebook on my github repo at the following link:
https://github.com/elenacuoco/kaggle-competitions/blob/master/Titanic-For_Blog.ipynb

 

 

 

Leave a Reply

2 Comments on "Kaggle Competition | Titanic Machine Learning from Disaster – Part II"

avatar
  Subscribe  
Notify of
sputknic
Guest

Thank you for this. I’ve been using pandas for a project for the past 6 months, that was only numbers. Started on my first project with strings, and your implementation of substring_in_string answered a question I didn’t know how to ask. Keep up the good work, best of luck to you!