Spark and Avro – in a Docker

These are really cliff notes for the next person, but quite useful.

Was working with Spark in a local Docker using the very useful  jupyter/docker-stacks, these are really nice to get a fully working Spark installation on your Mac without messing with lots of packages on your local machine. I’m now a big convert of working from Docker since I don’t have to keep on installing and installing things.

Basic Jupyter functionality —

1docker run -d \
2       -v /Users/koblas/spark_demo:/home/jovyan/work \
3       -p 8888:8888 -p 4040:4040 \
4       jupyter/all-spark-notebook

Quick explanation (for those not in the know):

  • -d == detached, run in the background
  • -v … == mount the local Mac path as the /work directory on the Docker
  • -p 8888:8888 == the Notebook UI port
  • -p 4040:4040 == the Spark UI

Nice part is that with this one command you have a fully functional environment to work from. With one exception, you can’t load an AVRO format file into Python Spark. You quickly find the reference for how to do it if you’ve got an exposed environment on the README https://github.com/databricks/spark-avro . Just one problem, it’s not at all clear what environment variable you need.

1docker run -d \
2       -v /Users/koblas/spark_demo:/home/jovyan/work \
3       -e PYSPARK_SUBMIT_ARGS='--packages com.databricks:spark-avro_2.10:2.0.1 pyspark-shell'
4       -p 8888:8888 -p 4040:4040 \
5       jupyter/all-spark-notebook

What is needed is the magic with PYSPARK_SUBMIT_ARGS which will invoke the shell with the additional package. This will work as well, it would be great if the jupyter documentation could get updated.