8

Is there an example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder?

My project structure is like this:

├── pipeline
│   ├── __init__.py
│   ├── main.py
│   ├── setup.py
│   ├── custom.py

I'm trying to import custom.py inside of main.py for a dataflow flex template.

I receive the following error in the pipeline execution:

ModuleNotFoundError: No module named 'custom'

The pipeline works fine if I include all of the code in a single file and don't make any imports.

Example Dockerfile:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base

ARG WORKDIR=/dataflow/template/pipeline
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY pipeline /dataflow/template/pipeline

COPY spec/python_command_spec.json /dataflow/template/

ENV DATAFLOW_PYTHON_COMMAND_SPEC /dataflow/template/python_command_spec.json

RUN pip install avro-python3 pyarrow==0.11.1 apache-beam[gcp]==2.24.0

ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"

Python spec file:

{
    "pyFile":"/dataflow/template/pipeline/main.py"
}
  

I am deploying the template with the following command:

gcloud builds submit --project=${PROJECT} --tag ${TARGET_GCR_IMAGE} .
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Akshay Apte
  • 1,539
  • 9
  • 24
  • Have you tried appending the ${WORKDIR} to the PYTHONPATH environment variable? You can try adding `ENV PYTHONPATH="${WORKDIR}:${PYTHONPATH}"` to your dockerfile. – Cubez Nov 18 '20 at 20:01
  • Yes. I tried appending to the PYTHONPATH. didn't seem to work – Akshay Apte Nov 19 '20 at 14:54
  • @AkshayApte do you have setup.py as the same level at custom.py? For me `find_packages` cannot find custom.py and it seems setup.py has to be one directory above - https://stackoverflow.com/questions/28573040/how-to-make-python-setuptools-find-top-level-modules curious how you made it work. – Kazuki Nov 23 '20 at 05:57

4 Answers4

4

I actually solved this by passing an additional parameter setup_file to the template execution. Also need to add setup_file parameter to the template metadata

--parameters setup_file="/dataflow/template/pipeline/setup.py"

Apparently the command ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py" in the Dockerfile is useless and doesnt actually pick up the setup file.

My setup file looked like this:

import setuptools

setuptools.setup(
    packages=setuptools.find_packages(),
    install_requires=[
        'apache-beam[gcp]==2.24.0'
    ],
 )
Akshay Apte
  • 1,539
  • 9
  • 24
  • wow thanks for posting this. For other people that might see here, I also want to mention that py_module in setup_files didn't work either. I'll try `find_packages()` now – Kazuki Nov 21 '20 at 06:17
  • `find_packages()` it somehow messed up my proto so I'm still trying to figure out how to get py_module work. hmm.. – Kazuki Nov 21 '20 at 22:11
  • I tried this and get `Unrecognized parameter` when sending in `setup_file` as a parameter in that way – Travis Webb Jan 04 '21 at 00:36
  • You also need to add setup_file parameter to the template metadata – Akshay Apte Jan 06 '21 at 11:55
  • I've found useful documentation on this at https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies – jamiet Jan 25 '21 at 22:07
  • I have successfully used `FLEX_TEMPLATE_PYTHON_SETUP_FILE` to declare the location of `setup.py` (which looks exactly the same as @akshay-apte's snippet above), no `setup_file` parameter required. I'm writing this more than two months after Akshay's answer so perhaps something has changed in the Dataflow service in the interim period which means `FLEX_TEMPLATE_PYTHON_SETUP_FILE` now works. HTH. – jamiet Jan 26 '21 at 13:19
  • 1
    @jamiet, can you share the code you're using. I'm trying to do the same using `FLEX_TEMPLATE_PYTHON_SETUP_FILE` in the dockerfile, in the dataflow logs it does show Executing: python /dataflow/template/streaming_beam.py --setup_file=/dataflow/template/setup.py ... but immediately it throws traceback module not found. It is not actually performing setup actions mentioned in setup.py – Pavan Kumar Kattamuri Feb 27 '21 at 08:56
  • @PavanKumarKattamuri Sure, have posted as an answer – jamiet Feb 28 '21 at 09:43
  • Hi jamie T,could you share more details? i am having same issues and have posted in stackoverflow here https://stackoverflow.com/questions/67857611/error-when-running-python-flex-template-module-from-subdirectory-cannot-be-foun – user1068378 Jun 06 '21 at 09:57
3

After some tests I found out that for some unknown reasons phyton files at working directory (WORKDIR) cannot be referenced with an import. But it works if you create a subfolder and move the python dependencies into it. I tested and it worked, for example, in your use case you can have the following structure:

├── pipeline
│   ├── main.py
│   ├── setup.py
│   ├── mypackage
│   │   ├── __init__.py
│   │   ├── custom.py

And you will be able to reference: import mypackage.custom. The Docker file should move in the custom.py to proper directory.

RUN mkdir -p ${WORKDIR}/mypackage
RUN touch ${WORKDIR}/mypackage/__init__.py
COPY custom.py ${WORKDIR}/mypackage

And the dependecy will be added to the python installation directory:

$ docker exec -it <container> /bin/bash
# find / -name custom.py
/usr/local/lib/python3.7/site-packages/mypackage/custom.py
rsantiago
  • 2,054
  • 8
  • 17
  • Did you achieve a successfully running Dataflow job using this technique? I've tried reproducing it and am still getting error `No module named 'protoc_gen` (protoc_gen is the package I'm adding my module to) – jamiet Jan 25 '21 at 21:38
  • What is in your `setup.py` file? – jamiet Jan 25 '21 at 21:39
0

Here is my solution:

Dockerfile:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base:flex_templates_base_image_release_20210120_RC00

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY requirements.txt .


# Read https://stackoverflow.com/questions/65766066/can-i-make-flex-template-jobs-take-less-than-10-minutes-before-they-start-to-pro#comment116304237_65766066
# to understand why apache-beam is not being installed from requirements.txt
RUN pip install --no-cache-dir -U apache-beam==2.26.0
RUN pip install --no-cache-dir -U -r ./requirements.txt

COPY mymodule.py setup.py ./
COPY protoc_gen protoc_gen/

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/mymodule.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

and here is my setup.py:

import setuptools

setuptools.setup(
    packages=setuptools.find_packages(),
    install_requires=[],
    name="my df job modules",
)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
jamiet
  • 10,501
  • 14
  • 80
  • 159
0

For me I didn't need to integrate the setup_file in the command to trigger the flex template, here is my Dockerfile:

FROM gcr.io/dataflow-templates-base/python38-template-launcher-base

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY . .

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

# Install apache-beam and other dependencies to launch the pipeline
RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt

This is the command:

gcloud dataflow flex-template run "job_ft" --template-file-gcs-location "$TEMPLATE_PATH" --parameters paramA="valA" --region "europe-west1"
Idhem
  • 880
  • 1
  • 9
  • 22