11

I am looking to use Databricks Connect for developing a pyspark pipeline. DBConnect is really awesome because I am able to run my code on the cluster where the actual data resides, so it's perfect for integration testing, but I also want to be able to, during development and unit testing (pytest with pytest-spark), simply using a local Spark environment.

Is there any way to configure DBConnect so for one use-case I simply use a local Spark environment, but for another it uses DBConnect?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
casparjespersen
  • 3,460
  • 5
  • 38
  • 63
  • Is `sc.stop()` `conf = SparkConf().setMaster("local")` `sc = SparkContext(conf = conf)` what you're looking for? – Mirabilis Jul 22 '20 at 08:17
  • 3
    Similarly, `SparkSession.builder.master('local').getOrCreate()` was working with `a new venv`. I used to be in the venv where the databricks-connect package having a pyspark is installed but this was still trying to connect with remote cluster. To resolve it, I'm have two venvs; one for databricks-connect(remote cluster) and one for local cluster – sunsets Jun 09 '21 at 05:09

1 Answers1

0

My 2 cents, since I've been done this type of development for some months now:

  • Work with two Python environments: one with databricks-connect (and thus, no pyspark installed), and another one with only pyspark installed. When you want to execute the tests, just activate the "local" virtual environment and run pytest as usual. Make sure, as some commenters pointed out, that you are initializing the pyspark session using SparkConf().setMaster("local").
  • Pycharm helps immensely to switch between environments during development. I am always on the "local" venv by default, but whenever I want to execute something using databricks-connect, I just create a new Run configuration from the menu. Easy peasy.

Also, be aware of some of databricks-connect's limitations:

  • It is not officially supported anymore, and Databricks recommend moving towards dbx whenever possible.
  • UDFs just won't work in databricks-connect.
  • Mlflow integration is not reliable. In my use case, I am able to download and use models, but unable to log a new experiment or track models using databricks tracking uri. This might depend on your Databricks Runtime, mlflow and local Python version.
itscarlayall
  • 128
  • 1
  • 14