I'm looking to stream CSV from an input file and write the corresponding rows into a parquet file. A common tool that's used for this is pyarrow, and they have a way to do this in batch using open_csv
, getting the table and writing the resulting pyarrow.Table
as a parquet file using pyarrow.parquet.write_table
. However, this won't work for a large CSV file which can't be stored in memory, even in arrow's binary format.
open_csv
gives us a CSVStreamingReader which is great, but we have no ParquetStreamingWriter
which takes chunks of RecordBatches. There is a RecordBatchFileWriter which can stream RecordBatches into a binary arrow file, but I'm looking for parquet.
I'm open to using other libraries if there's no functionalities like this in pyarrow.