Apache Spark for Kaggle competitions
I competed in Kaggle Bosch competition to predict the failures during the production lines. As described in another post, I decided to approach this competition using Apache Spark to be able to handle the big data problem.
It was my first approach to Spark, so I spent a lot of time in setting up the environment, tuning the configuration, understanding the I/O from and to csv format file and the use of DataFrame containers.
I started using the Spark Machine Learning library with python API, then I moved to Scala to integrate the xgboost j4 package for Spark. As many others in the competition, I started getting good score after the leak or “data property” were made public by Faron kaggler.
Thanks also to CPMP kaggler for his notebook for best threshold selection.
The code
Since I had no time to work on the leakage and features engineering, I used only the plain features, or a selection of them plus the magic features.
So I finished the competition in 114th position on private leader board moving up 2 positions (at least I did not overfit 🙂 ).
Anyhow, I want to share the code I used, the python and the scala-xgboost ones which can be of interested for whom wants to start with Apache Spark.
You can find the code in my Github repository:
Hi, I am interesting at your contribution. I am trying to run your codes. But, I didn’t find two files, magic.csv and happy.csv. Could you tell me how to get them? Thank you!
Hi,
Nice to see some one trying to solve kaggle problems in spark. Were you able to upload the code in kaggle. When I see the solution submitted either it is python or R. Hardly anyone attempts them in spark I guess.
Hi,
I was able only to put the link to the code in the kaggle forum. No way to let the spark code run in the script tool.