Build and install Spark on a Linux platform

Here a short guide to build, install and configure Apache Spark on a Linux platform.

You can decide which Spark release install in your environment. I started using Spark starting from version 2.0 and I did not use the pre compiled releases, but I compiled and configured it.

Besides that,  I preferred to use the version from github master development branch, but you can use any given branch from github.

So, fix the main path where you want to install your SPARK_HOME and clone there your preferred  release

#check out the master branch from Git:

# Master development branch
git clone git://github.com/apache/spark.git

or

# 2.0 maintenance branch with stability fixes on top of Spark 2.0.1
git clone git://github.com/apache/spark.git -b branch-2.0

All the information to build spark from scratch are available at this link

Once you cloned your Spark release, go in your Spark directory and use the following line to compile and install it

./build/mvn -DskipTests clean package install

First of all you can configure your environment variables in your .bashrc  which can point to your java, scala and python api of Spark.

##################Env for Spark######################
export SPARK_HOME="/home/spark"
export PATH=${SPARK_HOME}/bin:$PATH
 
#Add Maven support
if [ -d "/home/spark/build/apache-maven-3.3.9" ] ; then
 export M2_HOME=/home/spark/build/apache-maven-3.3.9
 export M2=$M2_HOME/bin
 export MAVEN_OPTS="-Xmx8G -XX:MaxPermSize=1G "
 export PATH=$M2:$PATH
fi
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip/:$PYTHONPATH

 

Specific configuration for Spark environment

All the Spark configuration can be set up under the spark/conf directory of your installation.

You can adapt these 2 files:

spark-defaults.conf and spark-env.sh

starting from the .template you will find in that directory.

For example you can put in spark-default.conf some of the configuration you will use in you Spark Session.

# Example:
 
spark.eventLog.enabled           true
spark.eventLog.dir /home/spark-log
spark.eventLog.enabled true
spark.local.dir /home/spark-tmp
spark.driver.executor.memory 4G
spark.driver.memory 4G
 

In spark-env.sh, instead you can setup all the variables you need for specific installation and parameters.

If needed, do not forget to do the source of this file.

If you want to try my xgboost and spark code, refer to xgboost installation.

Happy sparking!