Apache Spark on HPC

Article author
Jillian Rowe

There are no ready-made solutions for Spark and HPC, but you can use the Spark Standalone Cluster to create your own cluster by submitting jobs for the scheduler and workers.

Install Spark

First, bootstrap a conda env with Python, R, Pyspark, PyArrow, and the other packages you need for your analysis.

module load Miniconda3
conda create -n spark -c conda-forge pyspark
source activate spark
echo $PATH

 

Download Spark

Now download Spark from the downloads page, unzip it, and find the scripts start-master.sh and start-worker.sh and add those to your conda env $PATH.

# For Spark 3.0.1
wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
tar -xvf spark-3.0.1-bin-hadoop2.7.tgz
cp -rf spark-3.0.1-bin-hadoop2.7/sbin/* /path/to/conda/env/bin 

 

Depending on what you trying to do you may need more of the source files!

Submit the Scheduler

Apache Spark, Dask, and a few other similar technologies work by having a scheduler (master) and worker (slave) processes. There are plenty of managed solutions, but for HPC you’ll have to roll your own using the steps described in Spark Standalone Cluster.

./start-master.sh

You’ll probably want to submit a SLURM job with the master (scheduler).

# submit-spark-master.sh
#!/usr/bin/env bash

#SBATCH --time=1:00:00
#SBATCH --mem=2gb
#SBATCH --cpus-per-task=2

source activate spark
# If this port is taken you may have to try others
start-master.sh --host 0.0.0.0 --port 3001 --webui-port=3002

Submit

sbatch submit-spark-master.sh

Ok now get the hostname that the spark master is running on.

squeue

And wait until your submitted job has n ‘R’ state.

Submit the Workers

You can submit a job array, #SBATCH --array=0-10 to get many workers in one go.

# submit-spark-workers.sh
#!/usr/bin/env bash

#SBATCH --time=1:00:00
#SBATCH --mem=2gb
#SBATCH --cpus-per-task=2
#SBATCH --array=0-10 

source activate spark

# Make sure to change these to the values you used!
export SPARK_MASTER_HOST="HOST_NAME_FROM_SQUEUE"
export SPARK_MASTER_PORT="3001"
export SPARK_MASTER_WEBUI_PORT="3002"

# If this port is taken you may have to try others
./sbin/start-slave.sh

Then, you can submit your jobs to do your Spark calculations by creating a Spark context that points to your Scheduler.

View the Apache Spark Web UI

Please see the documentation on Proxying Web Services in Jupyterhub in order to view the Spark Web UI.

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.

Still have questions?

Submit a request