Data Sciencekaggle

Kaggle Bosch competition using Apache Spark

Apache Spark for Kaggle competitions


I competed in Kaggle Bosch competition to predict the failures during the production lines.  As described in another post, I decided to approach this competition using Apache Spark to be able to handle the big data problem.

It was my first approach to Spark, so I spent a lot of time in setting up the environment, tuning the configuration, understanding the I/O from and to csv format file and  the use of DataFrame containers.

I started using the Spark Machine Learning library with python API, then I moved to Scala to integrate the xgboost j4 package for Spark. As many others in the competition, I started getting good score after the leak or “data property” were made public by Faron kaggler.

Thanks also to CPMP  kaggler for his notebook for best threshold selection.

The code

Since I had no time to work on the leakage and features engineering, I used only the plain features, or a selection of them plus the magic features.

So I finished the competition in 114th position on  private leader board moving up 2 positions (at least I did not overfit 🙂 ).

Anyhow, I want to share the code I used, the python and the scala-xgboost ones which can be of interested for whom wants to start with Apache Spark.

You can find the code in my Github repository:



Leave a Reply

2 Comment threads
1 Thread replies
Most reacted comment
Hottest comment thread
3 Comment authors
IvanElena CuocoPavan A Recent comment authors

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Notify of

Hi, I am interesting at your contribution. I am trying to run your codes. But, I didn’t find two files, magic.csv and happy.csv. Could you tell me how to get them? Thank you!

Pavan A
Pavan A

Nice to see some one trying to solve kaggle problems in spark. Were you able to upload the code in kaggle. When I see the solution submitted either it is python or R. Hardly anyone attempts them in spark I guess.