Kaggle Bosch competition using Apache Spark

Apache Spark for Kaggle competitions

I competed in Kaggle Bosch competition to predict the failures during the production lines. As described in another post, I decided to approach this competition using Apache Spark to be able to handle the big data problem.

It was my first approach to Spark, so I spent a lot of time in setting up the environment, tuning the configuration, understanding the I/O from and to csv format file and the use of DataFrame containers.

I started using the Spark Machine Learning library with python API, then I moved to Scala to integrate the xgboost j4 package for Spark. As many others in the competition, I started getting good score after the leak or “data property” were made public by Faron kaggler.

Thanks also to CPMP kaggler for his notebook for best threshold selection.

The code

Since I had no time to work on the leakage and features engineering, I used only the plain features, or a selection of them plus the magic features.

So I finished the competition in 114th position on private leader board moving up 2 positions (at least I did not overfit 🙂 ).

Anyhow, I want to share the code I used, the python and the scala-xgboost ones which can be of interested for whom wants to start with Apache Spark.

You can find the code in my Github repository:

Bosch Kaggle competion: Reduce manufacturing failures (https://www.kaggle.com/c/bosch-production-line-performance)
https://github.com/elenacuoco/bosch-kaggle-competition-spark
13 forks.
25 stars.
0 open issues.

Recent commits:

Code examples of using Apache Spark in python or Scala for Bosch kaggle competions., elenacuoco
Create build.sbt, GitHub
Initial commit, elenacuoco

This site uses Akismet to reduce spam. Learn how your comment data is processed.

3 Comments

Inline Feedbacks

View all comments

Ivan

May 6, 2017 5:32 pm

Hi, I am interesting at your contribution. I am trying to run your codes. But, I didn’t find two files, magic.csv and happy.csv. Could you tell me how to get them? Thank you!

Pavan A

April 1, 2017 3:42 am

Hi,
Nice to see some one trying to solve kaggle problems in spark. Were you able to upload the code in kaggle. When I see the solution submitted either it is python or R. Hardly anyone attempts them in spark I guess.

Author

Elena Cuoco

April 1, 2017 6:10 am

Reply to Pavan A

Hi,
I was able only to put the link to the code in the kaggle forum. No way to let the spark code run in the script tool.

wpDiscuz

Apache Spark for Kaggle competitions

The code

Machine Learning course

The beginning of a new adventure

Our perspective paper on Nature Computational Science

elsaele19