Apache Spark for Kaggle competitions
I competed in Kaggle Bosch competition to predict the failures during the production lines. As described in another post, I decided to approach this competition using Apache Spark to be able to handle the big data problem.
It was my first approach to Spark, so I spent a lot of time in setting up the environment, tuning the configuration, understanding the I/O from and to csv format file and the use of DataFrame containers.
I started using the Spark Machine Learning library with python API, then I moved to Scala to integrate the xgboost j4 package for Spark. As many others in the competition, I started getting good score after the leak or “data property” were made public by Faron kaggler.
Thanks also to CPMP kaggler for his notebook for best threshold selection.
Since I had no time to work on the leakage and features engineering, I used only the plain features, or a selection of them plus the magic features.
So I finished the competition in 114th position on private leader board moving up 2 positions (at least I did not overfit 🙂 ).
Anyhow, I want to share the code I used, the python and the scala-xgboost ones which can be of interested for whom wants to start with Apache Spark.
You can find the code in my Github repository: