0

On a pipeline defined using the latest Apache Beam SDK for Python 2.2.0, I get this error when running a simple pipeline that reads and writes a BigQuery table.

Since a few rows have timestamps with year < 1900, the read operation fails. How can I patch this dataflow_worker package?

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
(4d31192aa4aec063): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
    def start(self):
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
    with self.scoped_start_state:
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
    with self.spec.source.reader() as reader:
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
    for value in reader:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativefileio.py", line 198, in __iter__
    for record in self.read_next_block():
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativeavroio.py", line 95, in read_next_block
    yield self.decode_record(record)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 110, in decode_record
    record, self.source.table_schema)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 104, in _fix_field_values
    record[field.name], field)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 83, in _fix_field_value
    return dt.strftime('%Y-%m-%d %H:%M:%S.%f UTC')
ValueError: year=200 is before 1900; the datetime strftime() methods require year >= 1900
Guille
  • 757
  • 2
  • 8
  • 24
Ricardo Cabral
  • 775
  • 1
  • 8
  • 17

1 Answers1

0

Unfortunately, you cannot patch it to work with timestamps because that is the internal implementation of Google's Apache Beam runner: Dataflow. So you will have to wait until this is fixed by Google (should this be identified as a bug). Please, report it as soon as possible because this is more a limitation of Python's version used rather than a bug.

The problem comes from strftime as you can see in the error. The documentation explicitly mentions it won't work with any year prior to 1900. A workaround, on your end though, is to convert the timestamp to a string (you can do this in BigQuery as specified in the documentation). And then in your Beam pipeline you can reconvert it again to some timestamp or whatever suits you best.

You also have an example on how to convert a datetimeobject to a string as the template of your error in answer. In the same question there is another answer that explains what has happened with this bug and how has it been solved (in Python) and what you can do. Unfortunately, the solution seems to avoid using strftime at all, and use some alternative instead.

Guille
  • 757
  • 2
  • 8
  • 24