2

Let's say I have a CSV file with several hundreds of million records. Then I want to convert that CSV into a Parquet file using Python and Pandas to read the CSV and write the Parquet file. But because the file is too big to read it into memory and write a single Parquet file, I decided to read the CSV in chunks of 5M records and create a Parquet file for every chunk. Why would I want to merge all those of parquet files into a single parquet file?

Thanks in advance.

Danny.Icha
  • 41
  • 1
  • 2
  • Does the answer to this question help: https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file? – fhorrobin Feb 14 '22 at 04:13

1 Answers1

0

In general, it's the small files problem; for companies working with big data, file count limits can be an issue if one does not consistently control this problem.

It's a problem to be solved as there is no benefit for read performance if you split up files to small files (each parquet file consists of multiple row groups that ensures good parallelism during FileScan operations by itself).

However, jobs gravtitate towards small files problem because there is a benefit for write performance as creating too large of a parquet file with too many row groups before it is flushed as a file can be extremely memory intensive (cost in resources provisioned and duration wise).

Tony Ng
  • 164
  • 2
  • 12
  • What exactly do you mean by "gravitate towards small files problem"? Is it better for (read) performance to have multiple smaller parquet files or a few big ones? – apio Feb 20 '23 at 13:48