-2

I have about 10 huge parquet files (each about 60~100 GB) , same format and same partitions. I want to combine all of them - what is the best way to do that? I keep having memory issue on aws so would hope to avoid reading ALL data in. thanks!

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
zhifff
  • 199
  • 1
  • 4
  • 15

2 Answers2

0

Is the destination an S3 bucket? If so, Firehose is the way to combine the files.

Append data to an S3 object

Arlo Guthrie
  • 1,152
  • 3
  • 12
  • 28
  • 1
    yes both the 10 parquet files and the destination are on S3. is there a better way to do it in glue? – zhifff Jan 16 '20 at 19:57
0

Run glue crawler over it and create external table in Glue Catalog. You can access all data from all 10 files.

Assuming you want to create one parquet file, use redshift unload command to do it. Refer https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

Sandeep Fatangare
  • 2,054
  • 9
  • 14