When using a sparkJDBCDataset
to load a table using a JDBC connection, I keep running into the error that spark cannot find my driver. The driver definitely exists on the machine and it's directory is specified inside the spark.yml
file under config/base
.
I've also followed the instructions and added def init_spark_session
method to src/project_name/run.py
. I'm suspicious though, that the sparksession defined here is not being picked up by the sparkJDBCDataset
class. When you look at the source code for creating the sparksession and loading datasets inside sparkJDBCDataset
, it looks like a vanilla sparksession with no configs is defined to load and save the data. The configs defined inside spark.yml
are not used to create this sparksession. Below is an excerpt from the source code
@staticmethod
def _get_spark():
return SparkSession.builder.getOrCreate()
def _load(self) -> DataFrame:
return self._get_spark().read.jdbc(self._url, self._table, **self._load_args)
When I load data from a jdbc source outside of Kedro, with a SparkSession defined with spark.jars
, the data loads in as expected.
Is there a way to specify spark.jars
as well other other sparkConf when building the sparksession that reads the data in?