Parquet with Null Value for column is converted to Integer

Question

I'm using python pandas to write a DataFrame to parquet in GCS, then using Bigquery Transfer Service to transfer the GCS parquet file to a Bigquery table. Sometimes when the DataFrame is small, an entire column might have NULL values. When this occurs, Bigquery treats that null value column as an INTEGER type instead of what the parquet claims it to be.

When trying to append it to an existing table that expects that column to be NULLABLE STRING, Big Query Transfer Service will fail with INVALID_ARGUMENT: Provided Schema does not match Table project.dataset.dataset_health_reports. Field asin has changed type from STRING to INTEGER; JobID: xxx

When I use BQDTS to write the parquet to a new table, it can create the table, but the null column becomes an Integer type.

Any idea how to make BQDTS respect the original type or to manually specify types?

Can you clarify how are you doing the transfer? I'm thrown off with your error where it compares `dataset_health_reports` and `new_table`? Are you executing a transfer from GCS to BQ (`new_table`)? Is `new_table` an empty table which is created prior the transfer? — Ricco D, Aug 19 '22 at 01:26
If possible can you provide the GCS file that you are using for the transfer? So the community members can reproduce your issue. Make sure there is no sensitive information in it. — Ricco D, Aug 19 '22 at 01:36

score 0 · Answer 1 · answered Oct 17 '22 at 13:13

to remedy this issue you can pre-define the schema for columns which can be ambigous. For example I want the street_address_two column to be string then I can define the schema argument in LoadJobConfig as: [bigquery.SchemaField("street_address_two", "STRING")]. The code will look like:

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("street_address_two", "STRING")
    ],
    source_format=bigquery.SourceFormat.PARQUET,
)

Parquet with Null Value for column is converted to Integer

1 Answers1