0

I am trying to save dataframes to parquet and then load them into redshift. For that i do the following:

parquet_buffer = BytesIO()
df.to_parquet(parquet_buffer,index=False,compression='gzip')
s3.Bucket(write_bucket).put_object(Key=write_path,Body=parquet_buffer.getvalue())

I then load the saved file directly into redshift using the "COPY" command:

COPY table_name
from write_path
iam_role my_iam_role
FORMAT AS PARQUET

It leads the following error: write path: has an incompatible Parquet schema for column ...

If I apply the same procedure with .csv It works just fine. What causes the problem when switching to parquet?

FrankyBravo
  • 438
  • 1
  • 4
  • 12
  • Could you share a sample of your data? – bcosta12 Dec 19 '19 at 13:55
  • are you using pyarrow or fastparquet? i recommend pyarrow – Jon Scott Dec 19 '19 at 16:24
  • I am using Pyarrow – FrankyBravo Dec 19 '19 at 16:26
  • Are you sure Redshift know your parquet file is gzipped (ccording to [docs](https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-columnar.html) it will know if the files have `.gz` suffix)? try adding `GZIP` flag to `COPY` command. – botchniaque Dec 20 '19 at 05:54
  • I don't think it is possible to have a GZIP flag as well as FORMAT AS PARQUET. I was hoping redshift would recognize by itself. However, the problem does not lie here, changing the compression to None did not change anything. – FrankyBravo Dec 28 '19 at 14:57

2 Answers2

2

If it's a problem with incompatible schema - there are some differences in column names or types.

You'd need to compare your table schema with parquet schema side by side - that should give you an answer.

So

  1. print parquet schema

  2. get redshift schema

    select *
    from svv_columns
    where table_name = 'table_name'
    
botchniaque
  • 4,698
  • 3
  • 35
  • 63
1

As suggested above, you need to make sure the datatypes match between parquet and redshift. When redshift is trying to copy data from parquet file it strictly checks the types. Parquet uses primitive types

binary, int type

If you see below example, date is stored as int32 and timestamp as int96 in Parquet. Make sure they are date type and timestamp type in redshift.

parquet-tools schema 20191217_175027_00001_krvcf_0552e0f1-5e05-4cb1-ae88-fe70ce5b91b5



message hive_schema {
  optional int32 operating_day (DATE);
  optional binary asset_id (UTF8);
  optional binary account_id (UTF8);
  optional int96 found_time;
  optional int96 trutst_time;

}
keder jar
  • 11
  • 2
  • It might be the problem: my timestamp has the format: OPTIONAL BINARY But redshift expect a timestamp of the form : 2019-12-12 00:20:04 I tried switching to varchar in redshift just to test and the problem remained. If I made a pd_to_parquet, the column type should be handled correctly by pandas right? – FrankyBravo Dec 28 '19 at 21:03
  • for me parquet file , the column is string, but the redshift is timestmap, and it doesn't work. and is there any other workaround solution, like in the copy command to add some other option? – Emma Y Mar 31 '20 at 16:19