Spark and XGBoost using Scala language
I would like to run xgboost on a big set of data. Unfortunately the integration of XGBoost and PySpark is not yet released, so I was forced to do this integration in Scala Language.
In this post I just report the scala code lines which can be useful to run spark and xgboost.
In a further post I’m going to show the software setup and the integration of this project in Itellij IDEA community edition IDE.
Now, Let’s code!
Firstly we need to open a Spark Session
We need to prepare the directory where we will write our results. The results will be saved in parquet format.
Therefore, I found the following solution to create automatically a new directory for the results
The data I’m using as test are the one of the Kaggle Bosch competition.
So, we will read the data in Spark DataFrame format
Now, we can fill the NaN and sample the train data, if we want
Finally, we are now ready to assemble the features
and prepare the train and test set for xgboost
Now, we are ready to prepare the parameters map for xgboost and run it on our data sets:
We store the prediction on validation set, since we will use them to tune the metric for this competition.
Finally, we are ready to prepare our prediction on test data set and save them on disk
Later, offline, we can read the data from disk and convert to Pandas format (I used python script to do this)
and prepare the submission file.