I'm a beginner with docker and spark with python and I'm trying out some spark examples, extracting data from a local PostgreSQL database. I've experimenting locally on a windows 10 machine running LTS Ubuntu 20.04. My docker-compose version is 1.28. I keep running into the same issue however, how do I add such-and-such a driver to my docker images. In this case, it's the postgresql jdbc driver. My question is very similar to this question. But, I'm using docker-compose instead of plain docker.
Here is the docker-compose section for the all-spark-notebook image:
services:
spark:
image: jupyter/all-spark-notebook:latest
ports:
- "8888:8888"
working_dir: /home/$USER/work
volumes:
- $PWD/work:/home/$USER/work
environment:
PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 --jars /usr/share/java/postgresql.jar pyspark-shell
The packages entry is necessary to get my kafka integration to work in jupyter (and it does). The --jars entry is my attempt to reference the postgresql jdbc driver installed in the ubuntu LTS terminal using:
sudo apt-get install libpostgresql-jdbc-java libpostgresql-jdbc-java-doc
In python, I've tried this:
conf = SparkConf()
conf.set("spark.jars", "/usr/share/java/postgresql.jar")
findspark.init()
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("My App") \
.getOrCreate()
dataframe = spark.read.format('jdbc').options(\
url = "jdbc:postgresql://host.docker.internal:5432/postgres?user=user&password=***",\
database='postgres',
dbtable='cloud.some-table'
).load()
dataframe.show()
But, I get the following error message:
java.sql.SQLException: No suitable driver
just like the referenced previous poster.
Any ideas? This should be easy, but I'm struggling.