2

I'm a beginner with docker and spark with python and I'm trying out some spark examples, extracting data from a local PostgreSQL database. I've experimenting locally on a windows 10 machine running LTS Ubuntu 20.04. My docker-compose version is 1.28. I keep running into the same issue however, how do I add such-and-such a driver to my docker images. In this case, it's the postgresql jdbc driver. My question is very similar to this question. But, I'm using docker-compose instead of plain docker.

Here is the docker-compose section for the all-spark-notebook image:

services:
  spark:
    image: jupyter/all-spark-notebook:latest
    ports:
      - "8888:8888"
    working_dir: /home/$USER/work
    volumes:
      - $PWD/work:/home/$USER/work
    environment:
      PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 --jars /usr/share/java/postgresql.jar pyspark-shell

The packages entry is necessary to get my kafka integration to work in jupyter (and it does). The --jars entry is my attempt to reference the postgresql jdbc driver installed in the ubuntu LTS terminal using:

sudo apt-get install libpostgresql-jdbc-java libpostgresql-jdbc-java-doc

In python, I've tried this:

conf = SparkConf()
conf.set("spark.jars", "/usr/share/java/postgresql.jar")

findspark.init()

spark = SparkSession \
    .builder \
    .config(conf=conf) \
    .appName("My App") \
    .getOrCreate()


dataframe = spark.read.format('jdbc').options(\
        url = "jdbc:postgresql://host.docker.internal:5432/postgres?user=user&password=***",\
        database='postgres',
        dbtable='cloud.some-table'
    ).load()

dataframe.show()

But, I get the following error message:

java.sql.SQLException: No suitable driver

just like the referenced previous poster.

Any ideas? This should be easy, but I'm struggling.

Buck8pe
  • 141
  • 4

1 Answers1

0

OK, since nobody has come back with an answer I'll post what worked for me (in the end). I'm not claiming this is the correct way to do this and I'm happy for someone to post up a better answer, but it may get someone out of trouble.

Since, different configurations (and versions!) require different solutions, I'll define my setup first. I'm using docker desktop for Windows 10 with Docker Engine V20.10.5. I'm managing my docker containers using docker-compose version 1.29.0. I'm using the latest all-spark-notebook (whatever version that is) and the postgresql-42.2.19 jdbc driver.

I'll also say that this is running on my local Windows machine with LTS installed and is for experimentation only.

The trick that worked for me was: a) use a package for the jdbc driver with spark. In this way, spark installs the package from maven at runtime (when you create the spark instance within Jupyter) and...

volumes:
  - $PWD/work:/home/$USER/work
environment:
  PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.postgresql:postgresql:42.2.19 --driver-class-path /home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.19.jar pyspark-shell

b) Understand where the package jars are unpacked and use that directory to tell spark where to find the associated jars. In my case, I used this command to start spark within Jupyter notebook:

spark = SparkSession \
    .builder \
    .config("spark.driver.extraClassPath", "/home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.19.jar") \
    .appName("My App") \
    .getOrCreate()

One other thing to note, this can be a bit flaky. If spark figures it needs to re-pull the files from maven (it'll do this the first time around, obviously), the library isn't picked up and the connection fails. However, running stop and up -d to recycle the containers and re-running the python script makes the connection happy. I don't pretend I know why, but my suspicion is that the way I have things set up, there's some dependency there.

Buck8pe
  • 141
  • 4