1

I am using GCP Composer to orchestrate the ETL…

When I created the instance, I set the Python version to Python 3

One of the tasks using DataFlowPythonOperator which works fine if initiated from our local dev-docker instance (Airflow v1.10.1 + Python 3.6.9)

it uses Apache Beam Python 3.6 SDK 2.16.0 if I run it from the Docker image which runs Airflow v1.10.1

Whenever we deploy to composer-1.7.9-airflow-1.10.1 the task runs with Python 2.7

It also always run the Dataflow job using Google Cloud Dataflow SDK for Python 2.5.0 if initiated from Composer

Composer by default consider the Python version 2.7, and that crashes a lot of the transformations…

I can’t find a way to configure Composer to use Python 3.x to create and run the Dataflow job…

Command:

$ gcloud composer environments describe etl --location us-central1

result:

softwareConfig:
    imageVersion: composer-1.7.9-airflow-1.10.1
    pythonVersion: '3'
Soliman
  • 1,132
  • 3
  • 12
  • 32

2 Answers2

1

The Python version of your Composer environment is not related with the Python version with which the Dataflow jobs will be executed.

Currently, the DataflowPythonOperator hard codes the Python version to 2. There is a Pull-Request submitted that fixes this but is yet to be released. You could wait for a release of an Airflow version with the applied fix or you could back-port it as described very detailed in the second part of this answer.

Also note that you have to include the Apache Beam SDK to the Python packages of your Composer environment. Since 2.16.0 is the first version that officially supports Python 3, I would suggest specifying apache-beam ==2.16.0 in the packages list.

As to why you can launch jobs in Python 3 in your local Airflow setup, I would suspect that the python command there defaults to Python 3.

itroulli
  • 2,044
  • 1
  • 10
  • 21
  • In this case, I will need to install apache-beam inside composer environment, which is not what I am looking for... When using the available DataFlowPythonOperator it runs the job with python2.7, which work with simple tasks. but when we have more complex transformations, it breaks... – Soliman Dec 26 '19 at 06:14
  • Have you tried the workaround on the [linked answer](https://stackoverflow.com/questions/58545759/no-module-named-airfow-gcp-how-to-run-dataflow-job-that-uses-python3-beam-2-15/58631655#58631655)? It should not be necessary to explicitly install apache-beam inside Composer for it to work. – Daniel Duato Jan 15 '20 at 08:36
0

There are a few steps I followed that solved this problem:

  1. Upgrade your Composer instance to a higher version. I upgraded to composer-1.8.3-airflow-1.10.2 (during the time I am writing the answer)
  2. You will need to override the DataFlowPythonOperator and DataFlowHook (you can follow this answer) or use this gist repo.
  3. Run your dag, it should create a Dataflow job using Python3

Happy coding...

Soliman
  • 1,132
  • 3
  • 12
  • 32