5

Maybe I misunderstand the purpose of packaging but it doesn't seem to helpful in creating an artifact for production deployment because it only packages code. It leaves out the conf, data, and other directories that make the kedro project reproducible.

I understand that I can use docker or airflow plugins for deployment but what about deploying to databricks. Do you have any advice here?

I was thinking about making a wheel that could be installed on the cluster but I would need to package the conf first. Another option is to just sync a git workspace to the cluster and run kedro via a notebook.

Any thoughts on a best practice?

dres
  • 1,172
  • 11
  • 15

3 Answers3

3

If you are not using docker and just using kedro to deploy directly on a databricks cluster. This is how we have been deploying kedro to databricks.

  1. CI/CD pipeline builds using kedro package. Creates a wheel file.

  2. Upload dist and conf to dbfs or AzureBlob file copy (if using Azure Databricks)

This will upload everything to databricks on every git push

Then you can have a notebook with the following:

  1. You can have an init script in databricks something like:
from cargoai import run
from cargoai.pipeline import create_pipeline

branch = dbutils.widgets.get("branch")

conf = run.get_config(
    project_path=f"/dbfs/project_name/build/cicd/{branch}"
)
catalog = run.create_catalog(config=conf)
pipeline = create_pipeline()

Here conf, catalog, and pipeline will be available

  1. Call this init script when you want to run a branch or a master branch in production like:
    %run "/Projects/InitialSetup/load_pipeline" $branch="master"

  2. For development and testing, you can run specific nodes
    pipeline = pipeline.only_nodes_with_tags(*tags)

  3. Then run a full or a partial pipeline with just SequentialRunner().run(pipeline, catalog)

In production, this notebook can be scheduled by databricks. If you are on Azure Databricks, you can use Azure Data Factory to schedule and run this.

mayurc
  • 267
  • 4
  • 13
  • how do you deal with combining the production conf with the credentials it needs? I was thinking just setting them env vars but I don't know if kedro handles overriding the ones from conf with the env vars. – dres Jan 21 '20 at 20:43
  • I guess it seems to me that it would be nice to package all the conf, data, etc... that is committed to git into the wheel so that we don't have to build some custom dbfs copy. Can't I just add the paths to setup.py to be packaged? – dres Jan 21 '20 at 20:49
  • 1
    I don't think Kedro looks at the env vars over `conf`. Kedro only looks at the `credentials**` files in `conf`. For production credentials, you can either create a `credentials` file in `conf` where your `/dbfs/project_name/` path is. Only when reading from S3, there can be env vars. OR You can also write an init on `ProjectContext` which extends `KedroContext` and ```def __init__(self, project_path: Union[Path, str]): super().__init__(project_path, extra_params=dict( creds=os.environ.get('pwd') ))``` this will replace ConfigLoader creds param – mayurc Jan 21 '20 at 21:38
  • 1
    I think it is discouraged to package `data` and credentials `conf` and package them. https://kedro.readthedocs.io/en/latest/04_user_guide/03_configuration.html#credentials Your data could be on a shared filesystem e.g. `dbfs`. The `conf` file copy other than `credentials` can happen as a part of `kedro package` – mayurc Jan 21 '20 at 21:42
  • Yeah I wouldn't package credentials but only other conf. Data packaging doesn't appear discouraged and I could see how it would be useful for some small datasets. Thanks again for your help! – dres Jan 21 '20 at 21:54
1

I found the best option was to just use another tool for packaging, deploying, and running the job. Using mlflow with kedro seems like a good fit. I do most everything in Kedro but use MLFlow for the packaging and job execution: https://medium.com/@QuantumBlack/deploying-and-versioning-data-pipelines-at-scale-942b1d81b5f5

name: My Project

conda_env: conda.yaml

entry_points:
  main:
    command: "kedro install && kedro run"

Then running it with:

mlflow run -b databricks -c cluster.json . -P env="staging" --experiment-name /test/exp

dres
  • 1,172
  • 11
  • 15
0

So there is a section of the documentation that deals with Databricks:

https://kedro.readthedocs.io/en/latest/04_user_guide/12_working_with_databricks.html

The easiest way to get started will probably be to sync with git and run via a Databricks notebook. However, as mentioned, there are other ways using the ".whl" and referencing the "conf" folder.

Tom Goldenberg
  • 566
  • 3
  • 6
  • 14
  • the docs you reference are ok for development but not for syncing and running a production pipeline. in particular, the production credentials need to come from somewhere and loading all the dependencies on the cluster is not addressed. – dres Jan 21 '20 at 20:41