2

When using a sparkJDBCDataset to load a table using a JDBC connection, I keep running into the error that spark cannot find my driver. The driver definitely exists on the machine and it's directory is specified inside the spark.yml file under config/base.

I've also followed the instructions and added def init_spark_session method to src/project_name/run.py. I'm suspicious though, that the sparksession defined here is not being picked up by the sparkJDBCDataset class. When you look at the source code for creating the sparksession and loading datasets inside sparkJDBCDataset, it looks like a vanilla sparksession with no configs is defined to load and save the data. The configs defined inside spark.yml are not used to create this sparksession. Below is an excerpt from the source code

    @staticmethod
    def _get_spark():
        return SparkSession.builder.getOrCreate()

    def _load(self) -> DataFrame:
        return self._get_spark().read.jdbc(self._url, self._table, **self._load_args)

When I load data from a jdbc source outside of Kedro, with a SparkSession defined with spark.jars, the data loads in as expected.

Is there a way to specify spark.jars as well other other sparkConf when building the sparksession that reads the data in?

Weiyi Yin
  • 70
  • 5
  • As pointed out by @tamsanh, `SparkSession` is a singleton, therefore will return a new session or an existing one if it has been initialized. Can you make sure that your `init_spark_session` is actually called before the corresponding dataset is loaded? Ideally, it should be invoked in `ProjectContext.__init__()` how suggested in [the documentation](https://kedro.readthedocs.io/en/stable/04_user_guide/09_pyspark.html#initialising-a-sparksession) – Dmitry Deryabin Mar 19 '20 at 18:26

1 Answers1

0

SparkSession.builder.getOrCreate will actually do as it says and will get the existing spark session. However, you’re correct that, if there is no existing session, then a vanilla session will be created.

Best place to run init_spark_session is in your run_package function, within your run.py context, right after the context is loaded. That run.py gets called when your kedro run command is called.

If you wish to test your catalog alone, then the simple work around here is to make sure that, in your testing code or what have you, you are calling the init_spark_session manually before executing the JDBC connection code.

This can be done with the following:

from kedro.context import load_context
kedro_project_path = “./“
context = load_context(kedro_project_path)
context.init_spark_session()

Where kedro_project_path is appropriate.

Sorry for the formatting btw, am on mobile.

tamsanh
  • 76
  • 1
  • 5
  • Wanted to follow up on this. We were able to get this working with kedro 0.15.2 and 0.15.8. We ended up running into this problem in kedro 0.15.5. So we solved this problem by updating kedro and rebuilding the pipeline in a new kedro project. – Weiyi Yin Apr 06 '20 at 15:36