flake8 linting for databricks python code in github using workflows

Question

I have my databricks python code in github. I setup a basic workflow to lint the python code using flake8. This fails because the names that are implicitly available to my script (like spark, sc, dbutils, getArgument etc) when it runs on databricks are not available when flake8 lints it outside databricks (in github ubuntu vm).

How can I lint databricks notebooks in github using flake8?

E.g. errors I get:

test.py:1:1: F821 undefined name 'dbutils'
test.py:3:11: F821 undefined name 'getArgument'
test.py:5:1: F821 undefined name 'dbutils'
test.py:7:11: F821 undefined name 'spark'

my notebook in github:

dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")

jdbcurl = getArgument("my_jdbcurl")

dbutils.fs.ls(".")

df_node = spark.read.format("jdbc")\
  .option("driver", "org.mariadb.jdbc.Driver")\
  .option("url", jdbcurl)\
  .option("dbtable", "my_table")\
  .option("user", "my_username")\
  .option("password", "my_pswd")\
  .load()

my .github/workflows/lint.yml

on:
  pull_request:
    branches: [ master ]

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v1
      with:
        python-version: 3.8
    - run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Lint with flake8
      run: |
        pip install flake8
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

You should find out how databricks invokes `flake8`, including what dependencies it provides. That will tell you how you should invoke `flake8` in GitHub Actions.. — bk2204, Apr 03 '20 at 20:11
@bk2204, I didn't quite get that. In this case it's github invoking `flake8` not databricks. — Kashyap, Apr 03 '20 at 22:19

rjurney · Answer 1 · 2023-08-18T20:42:52.560

2

One thing you can do is this:

from pyspark.sql import SparkSession


spark = SparkSession.builder.getOrCreate()

This will work with or without Databricks, in normal Python or in the pyspark client.

To detect if you are in a file or in a Databricks notebook, you can run:

try:
    __file__
    print("We are in a file, like in our IDE or being tested by flake8.")
except NameError:
    print("We are in a Databricks notebook. Act accordingly.")

You could then conditionally initialize or create dummy variables for display() and other tools.

This is only a partial solution. I am working on a better solution, I will keep this answer updated.

edited Aug 18 '23 at 20:42

answered Aug 10 '21 at 09:43

rjurney

4,824
5
41
62

how do we handle dbutils? – James Owers Nov 03 '21 at 08:28
@JamesOwers ```python from pyspark.dbutils import DBUtils dbutils = DBUtils() ``` This works in normal `pyspark`, on Databricks vis `databricks-connect` or on Databricks notebooks. – rjurney Dec 21 '21 at 21:55

score 1 · Answer 2 · answered Aug 10 '21 at 22:09

This is my opinion, all linters dont work for all use cases, this is what I do. I am using a pre-commit hook and ignoring rule F821.

# Flake rules: https://lintlyci.github.io/Flake8Rules/
- repo: https://gitlab.com/pycqa/flake8
  rev: 3.8.4
  hooks:
    - id: flake8
      exclude: (^docs/)
      additional_dependencies: [flake8-typing-imports==1.7.0]
      # F821 undefined name
      args:
        [
          "--max-line-length=127",
          "--config=setup.cfg",
          "--ignore=F821",
        ]

To match your syntax, add the --ignore flag:

flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --ignore=C901,F821 --statistics

score 1 · Answer 3 · answered Jan 26 '23 at 18:46

1

You can add --builtins=dbutils,spark,display to ignore variables that are built into Databricks IDE

answered Jan 26 '23 at 18:46

jencake

121
5

but you will hit the next issue, which is functions imported with `%run` being recognized as undefined. curious if anyone has a solution to that. so far, I've tried flake8_nb and nbqa, both seem to require that the notebook being imported has an .ipynb file extension, but Databricks syntax leaves out that extension. – jencake Jan 26 '23 at 18:50
Added a solution/answer. Also `--builtins=dbutils,spark,display` would satisfy `flake8` but in the long term, in production code, you'll need to run unit tests and so on, so you'll end up creating local spark sessions, so... – Kashyap May 09 '23 at 14:33

Kashyap · Accepted Answer · 2023-05-09T15:03:10.683

TL;DR

Don't use the built-in variable dbutils in code that would need to run locally (IDE, Unit tests, ...) and in Databricks (production). Create your own instance of DBUtils class instead.

Here is what we ended up doing:

Created a new dbk_utils.py

from pyspark.sql import SparkSession

def get_dbutils(spark: SparkSession):
    try:
        from pyspark.dbutils import DBUtils
        return DBUtils(spark)

    except ModuleNotFoundError:
        import IPython
        return IPython.get_ipython().user_ns["dbutils"]

And update the code that uses dbutils to use this utility:

from dbk_utils import get_dbutils

my_dbutils = get_dbutils()

my_dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")
my_dbutils.fs.ls(".")

jdbcurl = my_dbutils.widgets.getArgument("my_jdbcurl")

df_node = spark.read.format("jdbc")\
  .option("driver", "org.mariadb.jdbc.Driver")\
  .option("url", jdbcurl)\
  .option("dbtable", "my_table")\
  .option("user", "my_username")\
  .option("password", "my_pswd")\
  .load()

If you're trying to do unit testing as well, then check out:

flake8 linting for databricks python code in github using workflows

4 Answers4