Loading parquet file to Redshift

Question

I am trying to save dataframes to parquet and then load them into redshift. For that i do the following:

parquet_buffer = BytesIO()
df.to_parquet(parquet_buffer,index=False,compression='gzip')
s3.Bucket(write_bucket).put_object(Key=write_path,Body=parquet_buffer.getvalue())

I then load the saved file directly into redshift using the "COPY" command:

COPY table_name
from write_path
iam_role my_iam_role
FORMAT AS PARQUET

It leads the following error: write path: has an incompatible Parquet schema for column ...

If I apply the same procedure with .csv It works just fine. What causes the problem when switching to parquet?

Are you sure Redshift know your parquet file is gzipped (ccording to [docs](https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-columnar.html) it will know if the files have `.gz` suffix)? try adding `GZIP` flag to `COPY` command. — botchniaque, Dec 20 '19 at 05:54
I don't think it is possible to have a GZIP flag as well as FORMAT AS PARQUET. I was hoping redshift would recognize by itself. However, the problem does not lie here, changing the compression to None did not change anything. — FrankyBravo, Dec 28 '19 at 14:57

score 2 · Answer 1 · answered Dec 20 '19 at 06:00

If it's a problem with incompatible schema - there are some differences in column names or types.

You'd need to compare your table schema with parquet schema side by side - that should give you an answer.

So

print parquet schema

get redshift schema

select *
from svv_columns
where table_name = 'table_name'

keder jar · Answer 2 · 2019-12-20T16:28:18.020

1

As suggested above, you need to make sure the datatypes match between parquet and redshift. When redshift is trying to copy data from parquet file it strictly checks the types. Parquet uses primitive types

binary, int type

If you see below example, date is stored as int32 and timestamp as int96 in Parquet. Make sure they are date type and timestamp type in redshift.

parquet-tools schema 20191217_175027_00001_krvcf_0552e0f1-5e05-4cb1-ae88-fe70ce5b91b5



message hive_schema {
  optional int32 operating_day (DATE);
  optional binary asset_id (UTF8);
  optional binary account_id (UTF8);
  optional int96 found_time;
  optional int96 trutst_time;

}

edited Dec 20 '19 at 16:28

answered Dec 20 '19 at 16:23

keder jar

11
2

It might be the problem: my timestamp has the format: OPTIONAL BINARY But redshift expect a timestamp of the form : 2019-12-12 00:20:04 I tried switching to varchar in redshift just to test and the problem remained. If I made a pd_to_parquet, the column type should be handled correctly by pandas right? – FrankyBravo Dec 28 '19 at 21:03
for me parquet file , the column is string, but the redshift is timestmap, and it doesn't work. and is there any other workaround solution, like in the copy command to add some other option? – Emma Y Mar 31 '20 at 16:19

Loading parquet file to Redshift

2 Answers2