I have about 10 huge parquet files (each about 60~100 GB) , same format and same partitions. I want to combine all of them - what is the best way to do that? I keep having memory issue on aws so would hope to avoid reading ALL data in. thanks!
Asked
Active
Viewed 1,451 times
-2
2 Answers
0
Is the destination an S3 bucket? If so, Firehose is the way to combine the files.

Arlo Guthrie
- 1,152
- 3
- 12
- 28
-
1yes both the 10 parquet files and the destination are on S3. is there a better way to do it in glue? – zhifff Jan 16 '20 at 19:57
0
Run glue crawler over it and create external table in Glue Catalog. You can access all data from all 10 files.
Assuming you want to create one parquet file, use redshift unload
command to do it. Refer https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

Sandeep Fatangare
- 2,054
- 9
- 14
-
`df.repartition(1).write.format("parquet").mode("append").save("temp.parquet")` Add more DPUs to handle memory issue – Sandeep Fatangare Jan 17 '20 at 05:33