2

I'm developing some code that runs on Databricks. Given that Databricks can't be run locally, I need to run unit tests on a Databricks cluster. Problem is when I install the wheel that contains my files, test files are never installed. How do I install the test files?

Ideally I would like to keep src and tests in separate folders.


Here is my project's (pyproject.toml only) folder structure:

project
├── src
|   ├── mylib
│       ├── functions.py
│       ├── __init__.py
├── pyproject.toml
├── poetry.lock
└── tests
    ├── conftest.py
    └── test_functions.py

My pyproject.toml:

[tool.poetry]
name = "mylib"
version = "0.1.0"
packages = [
    {include = "mylib", from = "src"},
    {include = "tests"}
]

[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Without {include = "tests"} in pyproject.toml, poetry build doesn't include tests.

After poetry build I can see that the tests are included in wheel produced (python3 -m wheel unpack <mywheel.whl>). But after I deploy it as a library on a Databricks cluster, I do not see any tests folder (ls -r .../site-packages/mylib* in a Databricks notebook shell cell) though functions.py is installed.

I also tried moving tests under src and update toml to {include = "tests", from = "src"}, but then the wheel file produced contains mylib & tests with appropriate files, but only mylib gets installed on Databricks.

project
├── src
|   ├── mylib
│   │   ├── functions.py
│   │   └── __init__.py
|   └── tests
│       ├── conftest.py
│       └── test_functions.py
├── pyproject.toml
└── poetry.lock

As someone is trying to point to dbx as teh solution, I've tried to use it. It doesn't work. It has a bunch of basic restrictions (e.g. must use ML runtime), which renders it useless, not to mention it expects that you use whatever toolset it recommends. Perhaps in a few years it would do what this post needs.

Kashyap
  • 15,354
  • 13
  • 64
  • 103
  • have you seen the DBX tool: https://docs.databricks.com/dev-tools/dbx.html ? – Alex Ott Aug 25 '22 at 18:26
  • 1
    @AlexOtt, I did. List of it's limitations is very long. It sucks waaay too much overall to be useful at this point. Perhaps in a year or two it'll mature. – Kashyap Aug 25 '22 at 18:38
  • hi @Kashyap, > List of it's limitations is very long. It sucks waaay too much overall to be useful at this point. Perhaps in a year or two it'll mature. Could you maybe pinpoint what exactly is missing in `dbx` to make it work for you? – renardeinside Aug 26 '22 at 07:30
  • @renardeinside To begin with, could you see how to implement requirement in this post as well as https://stackoverflow.com/questions/73489698/how-to-reinstall-same-version-of-a-wheel-on-databricks-without-cluster-restart using `dbx`? It's a simple development cycle using `poetry` and trying to run `pytest` on databricks. If `dbx` can't do it then that would be one limitation. You can post a new question about limitations and tag me, I'll try to add what I recall. – Kashyap Aug 26 '22 at 14:27
  • 1
    hi @Kashyap, I've created an [issue](https://github.com/databrickslabs/dbx/issues/430) to add such case to documentation (spoiler: it should be possible, but will require some time for me to make a writeup of this for poetry). If you have more questions/issues/missing recipes in docs,I would assume that it's way more effective to create them as GitHub Issues on `dbx`, rather than as questions on Stackoverflow. – renardeinside Aug 26 '22 at 16:09

2 Answers2

0

If anyone else is suffering, here is what we ended up doing finally.

TL;DR;

  • Create a unit-test-runner.py that can install a wheel file and execute tests inside of it. Key is to install it at "notebook scope".
  • Deploy/copy unit-test-runner.py to databricks dbfs and create a job pointing to it. Job parameter is the wheel file to pytest.
  • Create a wheel of your code, copy it to databricks dbfs, run the job unit-test-runner with location of wheel file as parameter.

Project structure:

root
├── dist
│   └── my_project-0.1.0-py3-none-any.whl
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── module1.py
├── module2.py
├── housekeeping.py
├── common
│   └── aws.py
├── tests
│   ├── conftest.py
│   ├── test_module1.py
│   ├── test_module2.py
│   └── common
│       └── test_aws.py
└── unit_test_runner.py

unit-test-runner.py

import importlib.util
import logging
import os
import shutil
import sys
from enum import IntEnum

import pip
import pytest


def main(args: list) -> int:
    coverage_opts = []
    if '--cov' == args[0]:
        coverage_opts = ['--cov']
        wheels_to_test = args[1:]
    else:
        wheels_to_test = args

    logging.info(f'coverage_opts: {coverage_opts}, wheels_to_test: {wheels_to_test}')

    for wh_file in wheels_to_test:
        logging.info('pip install %s', wh_file)
        pip.main(['install', wh_file])
        # we assume wheel name like <pkg name>-version-...
        # E.g. my_module-0.1.0-py3-none-any.whl
        pkg_name = os.path.basename(wh_file).split('-')[0]
        # don't import module to avoid any issues with coverage data.
        pkg_root = os.path.dirname(importlib.util.find_spec(pkg_name).origin)
        os.chdir(pkg_root)

        pytest_opts = [f'--rootdir={pkg_root}']
        pytest_opts.extend(coverage_opts)

        logging.info(f'pytest_opts: {pytest_opts}')
        rc = pytest.main(pytest_opts)
        logging.info(f'pytest-status: {rc}/{os.waitstatus_to_exitcode(rc)}, wheel: {wh_file}')
        generate_coverage_data(pkg_name, pkg_root, wh_file)

        return rc.value if isinstance(rc, IntEnum) else rc


def generate_coverage_data(pkg_name, pkg_root, wh_file):
    if os.path.exists(f'{pkg_root}/.coverage'):
        shutil.rmtree(f'{pkg_root}/htmlcov', ignore_errors=True)
        output_tar = f'{os.path.dirname(wh_file)}/{pkg_name}-coverage.tar.gz'
        rc = os.system(f'coverage html --data-file={pkg_root}/.coverage && tar -cvzf {output_tar} htmlcov')
        logging.info('rc: %s, coverage data available at: %s', rc, output_tar)


if __name__ == "__main__":
    # silence annoying logging
    logging.getLogger("py4j").setLevel(logging.ERROR)
    logging.info('sys.argv[1:]: %s', sys.argv[1:])
    rc = main(sys.argv[1:])
    if rc != 0:
        raise Exception(f'Unit test execution failed. rc: {rc}, sys.argv[1:]: sys.argv[1:]')

WORKSPACE_ROOT='/home/kash/workspaces'
USER_NAME='kash@company.com'
cd $WORKSPACE_ROOT/my_project
echo 'copying runner..' && \
  databricks fs cp --overwrite unit_test_runner.py dbfs:/user/$USER_NAME/
  • Go to databricks GUI and create a job pointing to dbfs:/user/$USER_NAME/unit_test_runner.py. Can also be done using CLI.
    • Type of job: Python Script
    • Source: DBFS/S3
    • Path: dbfs:/user/$USER_NAME/unit_test_runner.py
  • Run databricks jobs list to find job id, e.g. 123456789
cd $WORKSPACE_ROOT/my_project
poetry build -f wheel # could be replaced with any builder that creates a wheel file
whl_file=$(ls -1tr dist/my_project*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying wheel...' && databricks fs cp --overwrite dist/$whl_file dbfs:/user/$USER_NAME/wheels
echo 'running job.....' && echo "launching job.." && \
  databricks jobs run-now --job-id 123456789 --python-params "[\"/dbfs/user/$USER_NAME/wheels/$whl_file\"]"
# OR with coverage
echo 'running job..' && echo "launching job with coverage.." && \
  databricks jobs run-now --job-id 123456789 --python-params "[\"--cov\", \"/dbfs/user/$USER_NAME/wheels/$whl_file\"]"

If you ran with --cov option then to get and open coverage report:

rm -f htmlcov/ my_project_coverage_report.tar.gz
databricks fs cp dbfs:/user/$USER_NAME/wheels/my_project_coverage_report.tar.gz .
tar -xvzf my_project_coverage_report.tar.gz
firefox htmlcov/index.html
Kashyap
  • 15,354
  • 13
  • 64
  • 103
-2

author of dbx is here.

I've updated the public doc, please take a look at this section for details on how to setup integration tests

UPD. as per comment:

It has a bunch of basic restrictions (e.g. must use ML runtime)

This is not a requirement, you just need to use any Databricks Runtime 10+. We'll change the project doc accordingly to point out this is not a limitation anymore.

it expects that you use whatever toolset it recommends

This statement is simply incorrect.

Here is a step-by-step walkthrough for exactly identical setup as above (maybe this is unclear from the doc, but it contains exactly the same steps):

  1. Create a project dir and move into it:
mkdir mylib && cd mylib
  1. Initialise a poetry project in it:
poetry init -n
  1. Provide the following poetry pyproject.toml:
[tool.poetry]
name = "mylib"
version = "0.1.0"
# without description and authors it won't be compiled to a wheel
description = "some description"
authors = []

packages = [
    {include = "mylib", from = "src"},
]

[tool.poetry.dependencies]
python = "^3.8"
pytest = "^7.1.2"
pytest-cov = "^3.0.0"
dbx = "^0.7.3"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
  1. Install dependencies locally to make dbx available:
poetry install
  1. Write some sample code, e.g. src/mylib/functions.py:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col

def cast_column_to_string(df: DataFrame, col_name: str) -> DataFrame:
    return df.withColumn(col_name, col(col_name).cast("string"))
  1. Write a test for it in tests/integration/sample_test.py:
from mylib.functions import cast_column_to_string
from pyspark.sql import SparkSession

def test_column_to_string():
    spark = SparkSession.builder.getOrCreate()
    df = spark.range(0,10)
    _converted = cast_column_to_string(df, "id")
    assert dict(_converted.dtypes)["id"] == "string"
  1. Create an entrypoint file tests/entrypoint.py:
import sys

import pytest

if __name__ == '__main__':
    pytest.main(sys.argv[1:])
  1. Configure the test workflow in the conf/deployment.yml:

custom:
  basic-cluster-props: &basic-cluster-props
    spark_version: "10.4.x-cpu-ml-scala2.12"

  basic-static-cluster: &basic-static-cluster
    new_cluster:
      <<: *basic-cluster-props
      num_workers: 1
      node_type_id: "{{some-node-type}}"

build:
  commands:
    - "poetry build -f wheel" #safe to use inside poetry venv

environments:
  default:
    workflows:
      - name: "mylib-tests"
        tasks:
          - task_key: "main"
            <<: *basic-static-cluster
            spark_python_task:
                python_file: "file://tests/entrypoint.py"
                # this call supports all standard pytest arguments
                parameters: ["file:fuse://tests/integration", "--cov=mylib"]
  1. Configure dbx to use specific profile:
dbx configure --profile=<your Databricks CLi profile name>

Checkpoint - final layout looks like this at this point:

.
├── conf
│   └── deployment.yml
├── poetry.lock
├── pyproject.toml
├── src
│   └── mylib
│       └── functions.py
└── tests
    ├── entrypoint.py
    └── integration
        └── sample_test.py

  1. Launch the tests on all-purpose cluster (also non-ML clusters are supported since Databricks Runtime version 10+):
dbx execute mylib-tests --task=main --cluster-name=<some-all-purpose-cluster>
  1. [Optional] Launch tests as a job on a job cluster:
dbx deploy mylib-tests --assets-only
dbx launch mylib-tests --from-assets
renardeinside
  • 377
  • 1
  • 9
  • I see, maybe the doc is still not 100% reflects this specific case, so I've added a very profound walkthrough. Would that work for your case? – renardeinside Aug 30 '22 at 17:28
  • Hopefully this helps someone who's willing to test `dbx` for you. As mentioned in OP, I can't use it currently because it expects I use an ML runtime (can't run tests on one runtime version and then deploy to prod with a different one). – Kashyap Aug 30 '22 at 20:16
  • > it expects I use an ML runtime (can't run tests on one runtime version and then deploy to prod with a different one) This statement is also incorrect - I've explicitly mentioned this in my answer: `also non-ML clusters are supported since Databricks Runtime version 10+`. Could you please point to the doc where this limitation is mentioned? I cannot find it in the project docs. It was a thing previously, but it's avaiable in non-ML Runtimes since DBR 10+. – renardeinside Aug 30 '22 at 20:38
  • UPD: spotted it, thanks a lot @Kashyap, really helpful. We'll update the docs accordingly. – renardeinside Aug 30 '22 at 20:59
  • Not sure if you got a chance to test your post. Few more things, 1. `poetry init -n` won't create the `src/` folder based structure or `pyproject.toml`. 2. `dbx execute mylib-tests --cluster-name=...` fails with `'.../mylib/.venv/bin/python: No module named poetry`. 3. `pyspark` is missing from dev dependencies. 4. On shell, `poetry build -f wheel` builds the wheel, which contains no tests. So I would wager if the wheel was transported to DBK cluster, execution on DBK would've failed with something like `entry_point.py not found`. – Kashyap Aug 30 '22 at 21:10
  • > Not sure if you got a chance to test your post. I've tested it thoroughly before writing. 1. poetry init -n won't create the src/ folder based structure or pyproject.toml Yes it won't. The instruction explicitly says to create src/mylib/functions.py file. 2. I'm not using venv here. However if you're running into issues here, I've added a comment. 3. pyspark is not required in the dependencies, since the code is executed on Databricks and it's already provided there. 4. tests folder and entrypoint file are uploaded during dbx execute, so your guess is simply incorrect. – renardeinside Aug 30 '22 at 21:42
  • This is the last time I tested this for you. 2. You can't use venv and poetry together, poetry manages the venv. Once I applied your "correction", build goes through. 3. Your local IDE workspace would not be compile clean. 4. `dbx execute` params in post are wrong. After I fixed (you can run it for yourself to see the error) it and ran this is what I get `"FileNotFoundError: File tests/entrypoint.py is mentioned in the task or job definition, but is non-existent"`. So perhaps your thorough testing needs to be a little more thorough. – Kashyap Aug 31 '22 at 15:28
  • I've just launched it with the exact layout as provided in the answer. 3. the question is not about local IDE support - the question is about launching this specific code on Databricks. this is provided. Installation of dependent libraries is out of context of this discussion. As for > FileNotFoundError: File tests/entrypoint.py is mentioned in the task or job definition, but is non-existent - it simply means that this file doesn't exist, which means that instruction wasn't followed accordingly (step 7 explicitly says where this file should be located and what should be in it). – renardeinside Aug 31 '22 at 16:10
  • the file obviously exits exactly as described in step 7, why would I call it an error otherwise.. ! – Kashyap Aug 31 '22 at 16:21
  • This is simply impossible then. Could you please run the following command in the project directory and share the output: tree . ? – renardeinside Aug 31 '22 at 16:24
  • Please point attention that `tests` directory is out of the `src` folder in the answer (they're on the same level). Maybe this is the root cause? – renardeinside Aug 31 '22 at 16:29
  • 1
    I agree with the comments that dbx is very tight to ML flow. If I just want to run etl pipeline, it still creates ml_experiment. also, I want to specify the whl location but it only accepts dbfs:/ – Pari Nov 28 '22 at 11:36