Using PySpark for RedHat Kaggle competition
Redhat Kaggle competition is not so prohibitive from a computational point of view or data management.
First of all, the merging of more data frame in PySpark is not as efficient as in pandas, and I don’t fully trust in the code I wrote to merge 2 different DataFrame in which the columns name are the same.
So a first implementation was to do firstly the merging of the data frame using the utilities of pandas and then analyse the data in spark, reading the merged csv file from disk.
This was the code:
Note the highlighted lines from 45 to 51
Since I had many difficulties in understanding how to prepare the data in the format
when dealing with not numeric features, I want to show how I solved the problem preparing a new schema for the DataFrame and inserting in
StringType the features I selected in a list.
This is necessary before using the Hashing features extraction of ML spark lib.
So, this code achieved a score of ~0.95 on the Public Leadboard and is definitely not optimized. Maybe there is still room to reach better score.