Apache Spark for Kaggle competitions


I competed in Kaggle Bosch competition to predict the failures during the production lines.  As described in another post, I decided to approach this competition using Apache Spark to be able to handle the big data problem.

It was my first approach to Spark, so I spent a lot of time in setting up the environment, tuning the configuration, understanding the I/O from and to csv format file and  the use of DataFrame containers.

I started using the Spark Machine Learning library with python API, then I moved to Scala to integrate the xgboost j4 package for Spark. As many others in the competition, I started getting good score after the leak or “data property” were made public by Faron kaggler.

Thanks also to CPMP  kaggler for his notebook for best threshold selection.

The code

Since I had no time to work on the leakage and features engineering, I used only the plain features, or a selection of them plus the magic features.

So I finished the competition in 114th position on  private leader board moving up 2 positions (at least I did not overfit 🙂 ).

Anyhow, I want to share the code I used, the python and the scala-xgboost ones which can be of interested for whom wants to start with Apache Spark.

You can find the code in my Github repository: