Build and install Spark on a Linux platform
Here a short guide to build, install and configure Apache Spark on a Linux platform.
You can decide which Spark release install in your environment. I started using Spark starting from version 2.0 and I did not use the pre compiled releases, but I compiled and configured it.
Besides that, I preferred to use the version from github master development branch, but you can use any given branch from github.
So, fix the main path where you want to install your SPARK_HOME and clone there your preferred release
#check out the master branch from Git: # Master development branch git clone git://github.com/apache/spark.git or # 2.0 maintenance branch with stability fixes on top of Spark 2.0.1 git clone git://github.com/apache/spark.git -b branch-2.0
All the information to build spark from scratch are available at this link
Once you cloned your Spark release, go in your Spark directory and use the following line to compile and install it
./build/mvn -DskipTests clean package install
First of all you can configure your environment variables in your .bashrc which can point to your java, scala and python api of Spark.
##################Env for Spark###################### export SPARK_HOME="/home/spark" export PATH=${SPARK_HOME}/bin:$PATH #Add Maven support if [ -d "/home/spark/build/apache-maven-3.3.9" ] ; then export M2_HOME=/home/spark/build/apache-maven-3.3.9 export M2=$M2_HOME/bin export MAVEN_OPTS="-Xmx8G -XX:MaxPermSize=1G " export PATH=$M2:$PATH fi export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip/:$PYTHONPATH
Specific configuration for Spark environment
All the Spark configuration can be set up under the spark/conf directory of your installation.
You can adapt these 2 files:
spark-defaults.conf and spark-env.sh
starting from the .template you will find in that directory.
For example you can put in spark-default.conf some of the configuration you will use in you Spark Session.
# Example: spark.eventLog.enabled true spark.eventLog.dir /home/spark-log spark.eventLog.enabled true spark.local.dir /home/spark-tmp spark.driver.executor.memory 4G spark.driver.memory 4G
In spark-env.sh, instead you can setup all the variables you need for specific installation and parameters.
If needed, do not forget to do the source of this file.
If you want to try my xgboost and spark code, refer to xgboost installation.
Happy sparking!