1

I'm looking to stream CSV from an input file and write the corresponding rows into a parquet file. A common tool that's used for this is pyarrow, and they have a way to do this in batch using open_csv, getting the table and writing the resulting pyarrow.Table as a parquet file using pyarrow.parquet.write_table. However, this won't work for a large CSV file which can't be stored in memory, even in arrow's binary format.

open_csv gives us a CSVStreamingReader which is great, but we have no ParquetStreamingWriter which takes chunks of RecordBatches. There is a RecordBatchFileWriter which can stream RecordBatches into a binary arrow file, but I'm looking for parquet.

I'm open to using other libraries if there's no functionalities like this in pyarrow.

OneRaynyDay
  • 3,658
  • 2
  • 23
  • 56
  • 2
    I think you are looking for [pyarrow.parquet.ParquetWriter](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) which (despite not having "streaming" in the name) supports incremental output. For a complete example take a look at https://stackoverflow.com/questions/68555085/how-can-i-chunk-through-a-csv-using-arrow/68563617#68563617 which shows how to open a large CSV file and chunk it into a parquet file. – Pace Aug 18 '21 at 20:43

0 Answers0