How I reached the first position in the leaderboard.


 

This was my first competition on Kaggle:
http://www.kaggle.com/

Data Science London is hosting a meetup on Scikit-learn. This competition is a practice ground for trying, sharing, and creating examples of sklearn’s classification abilities (if this turns in to something useful, we can follow it up with regression, or more complex classification problems).

Scikit-learn (sklearn) is an established, open-source machine learning library, written in Python with the help of NumPy, SciPy and Cython.

Scikit-learn is very user friendly, has a consistent API, and provides extensive documentation. Its implementation is high quality due to strict coding standards and high test coverage. Behind sklearn is a very active community, which is steadily improving the library.”

Read the description  @the kaggle website

 

It helped me a lot in understanding how these kinds of problems and competitions work.

I started using scikit -learn  and I was fully affected to this library! Very well written, documented, a lot of examples which helped me a lot in understanding the problems of classification.

I started playing with those data, reading also the various entries in the Forum.

My first attempts were with PCA, followed by SVM. I used  GridSearch to find the best parameters values.

I got good results, reaching more tha 90% accuracy.

Then I used the QDA classifier in a pipeline with a previous kernel approximation.  Obtaining better results without reaching the 99% of accuracy.

I got the best results preprocessing the data with  the GMM algorithm, as also suggested in this link  in the Kaggle forum.

So I was there in the first 10 position for long time.

In the meanwhile I started also working on Titanic competitions, learning much more things on the other classifiers as the Random Forest one, but also the Best Practices description and I started the organization of my Titanic Code following those practices.

Yesterday I realized that London-scikit competion was ending, so I decided to make a last submission.

I decided to try using RF (which I used in Titanic competion) as classifier obtaining, surprisingly, the  first position in the Leaderboard with 0.99255 accuracy!

I still had not time to organize a python notebook with my last code. I will do in the next days and shared it in the forum, whether I will be allowed to do that.

If you want to try my method, good luck!

…and Happy 2015!