There are no ready-made solutions for Spark and HPC, but you can use the Spark Standalone Cluster to create your own cluster by submitting jobs for the scheduler and workers.
First, bootstrap a conda env with Python, R, Pyspark, PyArrow, and the other packages you need for your analysis.
module load Miniconda3 conda create -n spark -c conda-forge pyspark source activate spark echo $PATH
Now download Spark from the downloads page, unzip it, and find the scripts
start-worker.sh and add those to your conda env $PATH.
# For Spark 3.0.1 wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz tar -xvf spark-3.0.1-bin-hadoop2.7.tgz cp -rf spark-3.0.1-bin-hadoop2.7/sbin/* /path/to/conda/env/bin
Depending on what you trying to do you may need more of the source files!
Submit the Scheduler
Apache Spark, Dask, and a few other similar technologies work by having a scheduler (master) and worker (slave) processes. There are plenty of managed solutions, but for HPC you’ll have to roll your own using the steps described in Spark Standalone Cluster.
You’ll probably want to submit a SLURM job with the master (scheduler).
# submit-spark-master.sh #!/usr/bin/env bash #SBATCH --time=1:00:00 #SBATCH --mem=2gb #SBATCH --cpus-per-task=2 source activate spark # If this port is taken you may have to try others start-master.sh --host 0.0.0.0 --port 3001 --webui-port=3002
Ok now get the hostname that the spark master is running on.
And wait until your submitted job has n ‘R’ state.
Submit the Workers
You can submit a job array,
#SBATCH --array=0-10 to get many workers in one go.
# submit-spark-workers.sh #!/usr/bin/env bash #SBATCH --time=1:00:00 #SBATCH --mem=2gb #SBATCH --cpus-per-task=2 #SBATCH --array=0-10 source activate spark # Make sure to change these to the values you used! export SPARK_MASTER_HOST="HOST_NAME_FROM_SQUEUE" export SPARK_MASTER_PORT="3001" export SPARK_MASTER_WEBUI_PORT="3002" # If this port is taken you may have to try others ./sbin/start-slave.sh
Then, you can submit your jobs to do your Spark calculations by creating a Spark context that points to your Scheduler.
View the Apache Spark Web UI
Please see the documentation on Proxying Web Services in Jupyterhub in order to view the Spark Web UI.