19

I'm working on a script where I'm sending a dataframe to BigQuery:

load_job = bq_client.load_table_from_dataframe(
    df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE])
)

# Wait for the load job to complete
return load_job.result() 

This is working fine but only if a schema has already been defined in BigQuery or if I'm defining the schema of my job in my script. If no schema has been defined I have the following error:

Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1661, in load_table_from_dataframe dataframe.to_parquet(tmppath, compression=parquet_compression) File "/env/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 117, in write **kwargs File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 1270, in write_table writer.write_table(table, row_group_size=row_group_size) File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 426, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1578661876547574000 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 151, in main df = df(param1, param2) File "/user_code/main.py", line 141, in get_df df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1677, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp_ps5xji9_job_634ff274.parquet'

Why is pyarrow generating this error? How can I solve it other than pre-defining schema?

Simon Breton
  • 2,638
  • 7
  • 50
  • 105

6 Answers6

14

i was getting the same error

when i inspect the dataframe i saw a value like this: 2021-09-30 23:59:59.999999998

You probably have a mismatch in your date fields with the bigquery default. then i used this code:

df['date_column'] =df['date_column'].astype('datetime64[s]')

then the problem was solved for me.

Bushmaster
  • 4,196
  • 3
  • 8
  • 28
  • This does work. Interestingly, though, the literal `'datetime64[s]'` appears to be missing in the pandas type. Pyright reports: `Argument of type "Literal['datetime64[s]']" cannot be assigned to parameter "dtype" of type "ExtensionDtype" in function "astype" "Literal['datetime64[s]']" is incompatible with "ExtensionDtype"`. – webelo Mar 03 '23 at 18:20
7

The default behavior when converting to Arrow or Parquet from pandas is to not allow silent data loss. There are options to set when doing the conversion to allow permit unsafe casts causing loss of timestamp precision or other forms of data loss. The BigQuery Python API would need to set these options, so it may be a bug in the BigQuery library. I suggest reporting on their issue tracker https://github.com/googleapis/google-cloud-python

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • 3
    I also got similar error when trying to convert a query from bigquery to_dataframe, and there is very few documentation on bigquery. `ArrowInvalid: Casting from timestamp[us, tz=UTC] to timestamp[ns] would result in out of bounds timestamp: -61535548800000000` – Yonas Kassa Jun 20 '20 at 15:16
7

The solution for me was to add the following kwargs to to_parquet:

parquet_args = {
    'coerce_timestamps': 'us',
    'allow_truncated_timestamps': True,
}

You have to set both of them. If you set just allow_truncated_timestamps, it will still raise the error if coerce_timestamps is None. I think the idea is that you only want to suppress the error if you're explicitly asking for coercion. Anyway, the docs are clear about it, but this behavior wasn't obvious to me.

BHP
  • 985
  • 2
  • 14
  • 18
3

If using write_dataset, example code to use file_options to eliminate this error:

import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
file_options = parquet_format.make_write_options(coerce_timestamps='us', allow_truncated_timestamps=True)

ds.write_dataset(..., file_options=file_options)

Added as anyone who queries the PyArrow error in the title will end up here.

Contango
  • 76,540
  • 58
  • 260
  • 305
1

I think these errors arise because the pyarrow.parquet module used by the BigQuery library does convert Python's built in datetime or time types into something that BigQuery recognises by default, but the BigQuery library does have its own method for converting pandas types.

I was able to get it to upload timestamps by changing all instances of datetime.datetime or time.time to pandas.Timestamp. For example:

my_df['timestamp'] = datetime.utcnow()

would need to be changed to

my_df['timestamp'] = pd.Timestamp.now()
Matt Simmons
  • 101
  • 3
  • This is a serious bug. What about dealing with a situation where you load data from an API call and need to set dtype there? I am only interested in keeping date but it throws error saying this conversion will lose `timestamp`. – Shivam Sahil Aug 11 '21 at 19:27
  • @ShivamSahil From my testing, this issue was fixed by pandas 1.1.0. If your application requires an older version of pandas, I recommend using pandas-gbq, which serializes to CSV instead of Parquet. – Tim Swast Oct 28 '21 at 22:05
0

In my testing of https://github.com/googleapis/python-bigquery-pandas/pull/413, this issue is fixed by upgrading to pandas 1.1.0+.

Looking at the pandas 1.1.0 changelog, there have been several bug fixes relating to timestamp data. I'm not sure which one in particular would have helped here, but potentially the fix for mixing and matching different timezones. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#parsing-timezone-aware-format-with-different-timezones-in-to-datetime

Tim Swast
  • 14,091
  • 4
  • 38
  • 61