Recently XGBoost project released a package on github where it is included interface to scala, java and spark (more info at this link).
Advertisements
I would like to run xgboost on a big set of data. Unfortunately the integration of XGBoost and PySpark is not yet released, so I was forced to do this integration in Scala Language.
In this post I just report the scala code lines which can be useful to run spark and xgboost.
We need to prepare the directory where we will write our results. The results will be saved in parquet format.
Therefore, I found the following solution to create automatically a new directory for the results
val now=Calendar.getInstance()
val date=java.time.LocalDate.now
val currentHour = now.get(Calendar.HOUR_OF_DAY)
val currentMinute = now.get(Calendar.MINUTE)
val direct="./results/"+date+"-"+currentHour+"-"+currentMinute+"/"
println(direct)
So, we will read the data in Spark DataFrame format
///read data from disk
val dataset = spark.read.option("header", "true").option("inferSchema", true).csv(inputPath + "/input/train_numeric.csv")
val datatest = spark.read.option("header", "true").option("inferSchema", true).csv(inputPath + "/input/test_numeric.csv")
dataset.cache()
datatest.cache()
Now, we can fill the NaN and sample the train data, if we want
//fill NA with 0 and subsample
val df = dataset.na.fill(0).sample(true,0.7,10)
val df_test = datatest.na.fill(0)
Finally, we are now ready to assemble the features
//prepare data for ML
val header = df.columns.filter(!_.contains("Id")).filter(!_.contains("Response"))
val assembler = new VectorAssembler()
.setInputCols(header)
.setOutputCol("features")
val train_DF0 = assembler.transform(df)
val test_DF0 = assembler.transform(df_test)
println("VectorAssembler Done!")
and prepare the train and test set for xgboost
val train = train_DF0.withColumn("label", df("Response").cast("double")).select("label", "features")
val test = test_DF0.withColumn("label", lit(1.0)).withColumnRenamed("Id","id").select("id", "label", "features")
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = train.randomSplit(Array(0.7, 0.3), seed = 0)
Now, we are ready to prepare the parameters map for xgboost and run it on our data sets:
// number of iterations
val numRound = 10
val numWorkers = 4
// training parameters
val paramMap = List(
"eta" -> 0.023f,
"max_depth" -> 10,
"min_child_weight" -> 3.0,
"subsample" -> 1.0,
"colsample_bytree" -> 0.82,
"colsample_bylevel" -> 0.9,
"base_score" -> 0.005,
"eval_metric" -> "auc",
"seed" -> 49,
"silent" -> 1,
"objective" -> "binary:logistic").toMap
println("Starting Xgboost ")
val xgBoostModelWithDF = XGBoost.trainWithDataFrame(trainingData, paramMap,round = numRound, nWorkers = numWorkers, useExternalMemory = true)
val predictions = xgBoostModelWithDF.setExternalMemory(true).transform(testData).select("label", "probabilities")
We store the prediction on validation set, since we will use them to tune the metric for this competition.
// DataFrames can be saved as Parquet files, maintaining the schema information
predictions.write.save(direct+"preds.parquet")
Finally, we are ready to prepare our prediction on test data set and save them on disk
//prediction on test set for submission file
val submission = xgBoostModelWithDF.setExternalMemory(true).transform(test).select("id", "probabilities")
submission.show(10)
submission.write.save(direct+"submission.parquet")
spark.stop()
}
}
Later, offline, we can read the data from disk and convert to Pandas format (I used python script to do this)
I'm a scientist who loves data, analyzing it, extracting information. I work in the field of gravitational wave research and the application of artificial intelligence techniques.
I also love to cook, use sourdough and especially love to cook for my loved ones. I also love walking, biking, yoga, being in nature, gardening, taking care of my home, and I love Apulia, my homeland, and Tuscany, my adopted land.
In this website you will find posts both in Italian (my native language) and English. My intention is to talk about my work, my research, the signal processing and machine learning, but also about my passions such as travelling and photography, things I love doing in my spare time. I made all the photos you see in this site. Enjoy the site, have fun and share my posts if you liked them.
Submitting an article through #OpenAccess journal MLST means that your work will be reviewed to a high editorial standard and be indexed in both Scopus and Web of Science. To submit, click here: http://ow.ly/GWVT50IHrpe
Advanced machine-learning techniques show great promise for future detections and analyses of gravitational waves, describe Marco Cavaglia, Shaon Ghosh, Elena Cuoco and Jade Powell
https://cerncourier.com/a/gravitational-wave-astronomy-turns-to-ai/
Tired and excited about my first workshop in attendance these past two days in Bologna #AI at @INFN_ https://agenda.infn.it/event/29907/timetable/. Interesting talks and discussions and new ideas...I get back to the office and find a much appreciated gift from the University of Geneva! 🙏 thank you!
hi
Is ther someone who coulld help me i am working on a sentiment analysis project using Pyspark where labels are real wthin 0 and 1
I can’t apply the Random FOrest of the Pyspark Mllib
Hi Team,
I m a beginner to bigdata…can you please help me in this how to start and what i have to refer.
If possible please share me the Git-hubs and any links videos etc.
Regards
DharmarajN
Hello,
have a look here…
https://www.kaggle.com/learn/overview