I knew Kaggle community after having followed a course on Machine Learning at EdX Caltech.
I started playing with those data using python and scikit-learn library (http://scikit-learn.org/stable/).
Scikit-learn is a very well written and documented python library on almost everything which concerns Machine Learning.
After obtaining good results in my first competition on Data Science London, I decided to move to real data to understand better how Machine Learning works for classification.
The data from the Titanic disaster are interesting because I realize that, before hoping to be able to produce a good prediction, you have to understand better what data you have in your hands.
In this competition you have a set of traing data and a set of test data on which you have to do your predictions.
For the training data you have 891 entries with its ‘PassengerId’, its label ‘Survived’ and the following 10 features
[‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’, ‘Embarked’]
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
Some of these features have missing value and you have to decide what to do with them: you can decide to throw away those data, to build a simple model for imputing them, or build a more sofisticated model.
Moreover most features are string, as name of passengers, sex, Embarked and you have to translate them in numerical label if you want to feed a classifier with them.
At my first attempts I decide to work with very simple model without thinking a lot on features selection, but let it be done by algorithm as selection of best features (as Kbest), or features reduction as PCA, or more sofisticate algorithm as RFECV
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html , but the results I obtained were not satisfactory, and the value of prediction was below 0.79.
So I relized I need a better model model and some pre-data conditioning!
I used LabelEncoder for the translation of string in number. At first attempt I used also Imputer to find a good solution for the missing values, but it did not give the results I wanted, so I decided to build an ad hoc model, imputing the missing values with some mixed criterion.
It is importat to have a good imputer for the missing values of the ages!
I used RandomForestClassifier to find my prediction.
The scores found with 3-fold cross validation were very different from the ones obtained with submission.
I obtained better results with 10-fold cross validation.
Now I got 324th/2652 position with score 0.81340
…but I’m still trying to reach a better result!