5

I have large CSV files that I'd ultimately like to convert to parquet. Pandas won't help because of memory constraints and its difficulty handling NULL values (which are common in my data). I checked the PyArrow docs and there are tools for reading parquet files, but I didn't see anything about reading CSVs. Did I miss something, or is this feature somehow incompatible with PyArrow?

dudemonkey
  • 1,091
  • 5
  • 15
  • 26

2 Answers2

5

We're working on this feature, there is a pull request up now: https://github.com/apache/arrow/pull/2576. You can help by testing it out!

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
3

You can read the CSV in chunks with pd.read_csv(chunksize=...), then write a chunk at a time with Pyarrow.

The one caveat is, as you mentioned, Pandas will give inconsistent dtypes if you have a column that is all nulls in one chunk, so you have to make sure the chunk size is larger than the longest run of nulls in your data.

This reads CSV from stdin and writes Parquet to stdout (Python 3).

#!/usr/bin/env python
import sys

import pandas as pd
import pyarrow.parquet

# This has to be big enough you don't get a chunk of all nulls: https://issues.apache.org/jira/browse/ARROW-2659
SPLIT_ROWS = 2 ** 16

def main():
    writer = None
    for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
        table = pyarrow.Table.from_pandas(split, preserve_index=False)
        # Timestamps have issues if you don't convert to ms. https://github.com/dask/fastparquet/issues/82
        writer = writer or pyarrow.parquet.ParquetWriter(sys.stdout.buffer, table.schema, coerce_timestamps='ms', compression='gzip')
        writer.write_table(table)
    writer.close()

if __name__ == "__main__":
    main()
Doctor J
  • 5,974
  • 5
  • 44
  • 40
  • There is a safe approach to convert a csv to parquet by chunks and without risk of schema error caused by inconsistent dtypes of each chunk. Was posted in [this topic](https://stackoverflow.com/a/74871381) – the_RR Dec 23 '22 at 12:49