1

I have a dataflow that works with DirectRunner but when I create a template version of it with DataflowRunner I have this error

  | 'Read from BQ Table' >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))
.\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\io\gcp\bigquery.py:1971: BeamDeprecationWarning: options is deprecated since First stable 
release. References to <pipeline>.options will not be supported
  temp_location = pcoll.pipeline.options.view_as(
.\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\io\gcp\bigquery_file_loads.py:900: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported
  temp_location = p.options.view_as(GoogleCloudOptions).temp_location
INFO:apache_beam.runners.portability.stager:Executing command: ['C:\\Users\\PhuongAnhNguenVenefi\\virtualenvs\\nieuwbouw-data\\Scripts\\python.exe', '-m', 'pip', 'download', '--dest', 'C:\\Users\\PHUONG~1\\AppData\\Local\\Temp\\dataflow-requirements-cache', '-r', 'requirements-template.txt', '--exists-action', 'i', '--no-binary', ':all:']
Traceback (most recent call last):
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\utils\processes.py", line 91, in check_output
    out = subprocess.check_output(*args, **kwargs)
  File ".\anaconda3\lib\subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File ".\anaconda3\lib\subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\PhuongAnhNguenVenefi\\virtualenvs\\nieuwbouw-data\\Scripts\\python.exe', '-m', 'pip', 'download', '--dest', 'C:\\Users\\PHUONG~1\\AppData\\Local\\Temp\\dataflow-requirements-cache', '-r', 'requirements-template.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.   

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File ".\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File ".\Desktop\nieuwbouw-data\am_gcp\automation\template_housetype_am.py", line 288, in <module>
    run()
  File ".\Desktop\nieuwbouw-data\am_gcp\automation\template_housetype_am.py", line 283, in run
    p.run().wait_until_finish()
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\pipeline.py", line 514, in run
    return Pipeline.from_runner_api(
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\pipeline.py", line 547, in run
    return self.runner.run_pipeline(self, self._options)
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 493, in run_pipeline
    artifacts=environments.python_sdk_dependencies(options)))
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\transforms\environments.py", line 623, in python_sdk_dependencies
    staged_name in stager.Stager.create_job_resources(
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\runners\portability\stager.py", line 177, in create_job_resources
    (
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\utils\retry.py", line 236, in wrapper
    return fun(*args, **kwargs)
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\runners\portability\stager.py", line 571, in _populate_requirements_cache
    processes.check_output(cmd_args, stderr=processes.STDOUT)
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\utils\processes.py", line 96, in check_output
    raise RuntimeError( \
RuntimeError: Full traceback: Traceback (most recent call last):
  File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\utils\processes.py", line 91, in check_output
    out = subprocess.check_output(*args, **kwargs)
  File ".\anaconda3\lib\subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File ".\anaconda3\lib\subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\PhuongAnhNguenVenefi\\virtualenvs\\nieuwbouw-data\\Scripts\\python.exe', '-m', 'pip', 'download', '--dest', 'C:\\Users\\PHUONG~1\\AppData\\Local\\Temp\\dataflow-requirements-cache', '-r', 'requirements-template.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.   

 Pip install failed for package: -r

The deploying command:

python template_housetype_am.py --runner DataflowRunner --project vf-scrapers --region=europe-west4 --staging_location gs://am_scraper/test --temp_location gs://am_scraper/test --template_location gs://am_scraper/templates/template_housetype_am --experiment=use_beam_bq_sink --requirements_file requirements-template.txt --save_main_session True 

The requirements-template.txt file:

apache-beam[gcp]==2.25.0
google-cloud==0.34.0
google-cloud-bigquery==1.28.0
google-cloud-storage==1.33.0

The issue does not come from the template code, I have this error even if the template code only has importing statement:

import argparse
import logging
import re
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import storage

I tried:

  • Use setup.py per this suggestion but the setup_file argument is discarded:

    WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['setup.py', 'True'] WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['setup.py', 'True']

  • Not including the requirements file, which successfully create the template but the flow fails because google-cloud-storage is not installed. In other words, specifying dependencies is a must for me. Unless there are other ways to install dependencies on Dataflow.

  • Deploy the template using cloud shell, which worked. However, I need to deploy it on my local machine.

pa-nguyen
  • 417
  • 1
  • 5
  • 16
  • 1
    Which version of pip are you using? I think this is a dependency installing issue due to pip version, try to install the latest version of pip and try again – Messier_31 Dec 15 '20 at 21:23
  • @Messier_31 I am also facing same issue. I tried upgrading my pip version to the latest one (pip-20.3.3) but still getting same error. – Kaustubh Ghole Jan 10 '21 at 17:00

0 Answers0