I have a dataflow that works with DirectRunner but when I create a template version of it with DataflowRunner I have this error
| 'Read from BQ Table' >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))
.\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\io\gcp\bigquery.py:1971: BeamDeprecationWarning: options is deprecated since First stable
release. References to <pipeline>.options will not be supported
temp_location = pcoll.pipeline.options.view_as(
.\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\io\gcp\bigquery_file_loads.py:900: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported
temp_location = p.options.view_as(GoogleCloudOptions).temp_location
INFO:apache_beam.runners.portability.stager:Executing command: ['C:\\Users\\PhuongAnhNguenVenefi\\virtualenvs\\nieuwbouw-data\\Scripts\\python.exe', '-m', 'pip', 'download', '--dest', 'C:\\Users\\PHUONG~1\\AppData\\Local\\Temp\\dataflow-requirements-cache', '-r', 'requirements-template.txt', '--exists-action', 'i', '--no-binary', ':all:']
Traceback (most recent call last):
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\utils\processes.py", line 91, in check_output
out = subprocess.check_output(*args, **kwargs)
File ".\anaconda3\lib\subprocess.py", line 411, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File ".\anaconda3\lib\subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\PhuongAnhNguenVenefi\\virtualenvs\\nieuwbouw-data\\Scripts\\python.exe', '-m', 'pip', 'download', '--dest', 'C:\\Users\\PHUONG~1\\AppData\\Local\\Temp\\dataflow-requirements-cache', '-r', 'requirements-template.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File ".\anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File ".\Desktop\nieuwbouw-data\am_gcp\automation\template_housetype_am.py", line 288, in <module>
run()
File ".\Desktop\nieuwbouw-data\am_gcp\automation\template_housetype_am.py", line 283, in run
p.run().wait_until_finish()
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\pipeline.py", line 514, in run
return Pipeline.from_runner_api(
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\pipeline.py", line 547, in run
return self.runner.run_pipeline(self, self._options)
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 493, in run_pipeline
artifacts=environments.python_sdk_dependencies(options)))
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\transforms\environments.py", line 623, in python_sdk_dependencies
staged_name in stager.Stager.create_job_resources(
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\runners\portability\stager.py", line 177, in create_job_resources
(
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\utils\retry.py", line 236, in wrapper
return fun(*args, **kwargs)
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\runners\portability\stager.py", line 571, in _populate_requirements_cache
processes.check_output(cmd_args, stderr=processes.STDOUT)
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\utils\processes.py", line 96, in check_output
raise RuntimeError( \
RuntimeError: Full traceback: Traceback (most recent call last):
File ".\virtualenvs\nieuwbouw-data\lib\site-packages\apache_beam\utils\processes.py", line 91, in check_output
out = subprocess.check_output(*args, **kwargs)
File ".\anaconda3\lib\subprocess.py", line 411, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File ".\anaconda3\lib\subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\PhuongAnhNguenVenefi\\virtualenvs\\nieuwbouw-data\\Scripts\\python.exe', '-m', 'pip', 'download', '--dest', 'C:\\Users\\PHUONG~1\\AppData\\Local\\Temp\\dataflow-requirements-cache', '-r', 'requirements-template.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.
Pip install failed for package: -r
The deploying command:
python template_housetype_am.py --runner DataflowRunner --project vf-scrapers --region=europe-west4 --staging_location gs://am_scraper/test --temp_location gs://am_scraper/test --template_location gs://am_scraper/templates/template_housetype_am --experiment=use_beam_bq_sink --requirements_file requirements-template.txt --save_main_session True
The requirements-template.txt
file:
apache-beam[gcp]==2.25.0
google-cloud==0.34.0
google-cloud-bigquery==1.28.0
google-cloud-storage==1.33.0
The issue does not come from the template code, I have this error even if the template code only has importing statement:
import argparse
import logging
import re
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import storage
I tried:
Use
setup.py
per this suggestion but the setup_file argument is discarded:WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['setup.py', 'True'] WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['setup.py', 'True']
Not including the requirements file, which successfully create the template but the flow fails because google-cloud-storage is not installed. In other words, specifying dependencies is a must for me. Unless there are other ways to install dependencies on Dataflow.
Deploy the template using cloud shell, which worked. However, I need to deploy it on my local machine.