ModuleNotFoundError: No module named 'pyspark.dbutils' while running multiple.py file/notebook on job clusters in databricks

Question

I am working in TravisCI, MlFlow and Databricks environment where .tavis.yml sits at git master branch and detects any change in .py file and whenever it gets updated, It will run mlflow command to run .py file in databricks environment. my MLProject file looks as following:

name: mercury_cltv_lib
conda_env: conda-env.yml


entry_points:    
  main:
    command: "python3 run-multiple-notebooks.py"

Workflow is as following: TravisCI detects change in master branch-->triggers build which will run MLFlow command and it'll spin up a job cluster in databricks to run .py file from repo.

It worked fine with one .py file but when I tried to run multiple notebook using dbutils, it is throwing

  File "run-multiple-notebooks.py", line 3, in <module>
    from pyspark.dbutils import DBUtils
ModuleNotFoundError: No module named 'pyspark.dbutils'

Please find below the relevant code section from run-multiple-notebooks.py

  def get_spark_session():
    from pyspark.sql import SparkSession
    return SparkSession.builder.getOrCreate()

  def get_dbutils(self, spark = None):
    try:
        if spark == None:
            spark = spark

        from pyspark.dbutils import DBUtils #error line
        dbutils = DBUtils(spark) #error line
    except ImportError:
        import IPython
        dbutils = IPython.get_ipython().user_ns["dbutils"]
    return dbutils

  def submitNotebook(notebook):
    print("Running notebook %s" % notebook.path)
    spark = get_spark_session()
    dbutils = get_dbutils(spark)

I tried all the options and tried

https://stackoverflow.com/questions/61546680/modulenotfounderror-no-module-named-pyspark-dbutils

as well. It is not working :(

Can someone please suggest if there is fix for the above-mentioned error while running .py in job cluster. My code works fine inside databricks local notebook but running from outside using TravisCI and MLFlow isn't working which is must requirement for pipeline automation.

ModuleNotFoundError: No module named 'pyspark.dbutils' while running multiple.py file/notebook on job clusters in databricks

0 Answers0