22

For example, pandas's read_csv has a chunk_size argument which allows the read_csv to return an iterator on the CSV file so we can read it in chunks.

The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks like read_csv.

Is there a way to read parquet files in chunks?

xiaodai
  • 14,889
  • 18
  • 76
  • 140

5 Answers5

23

You can use iter_batches from pyarrow. to_pandas method should give you pandas DataFrame.

Example:

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile('example.parquet')

for batch in parquet_file.iter_batches():
    print("RecordBatch")
    batch_df = batch.to_pandas()
    print("batch_df:", batch_df)
Michał Słapek
  • 1,112
  • 7
  • 10
9

If your parquet file was not created with row groups, the read_row_group method doesn't seem to work (there is only one group!).

However if your parquet file is partitioned as a directory of parquet files you can use the fastparquet engine, which only works on individual files, to read files then, concatenate the files in pandas or get the values and concatenate the ndarrays

import pandas as pd
from glob import glob
files = sorted(glob('dat.parquet/part*'))

data = pd.read_parquet(files[0],engine='fastparquet')
for f in files[1:]:
    data = pd.concat([data,pd.read_parquet(f,engine='fastparquet')])
Koedlt
  • 4,286
  • 8
  • 15
  • 33
lee
  • 411
  • 5
  • 5
  • 1
    The only problem with this method is that if one modifies parquet with pandas, it is no longer readable with pyspark. I have tried creating new columns or modifying values in the existing columns - both experiments failed with "checksum error" – Sokolokki Feb 01 '22 at 20:22
4

I'm not sure if one can do it directly from pandas but pyarrow exposes read_row_group. The resulting Table should be convertable to a pandas dataframe with to_pandas

As of pyarrow 3.0 there is now a iter_batches method that can be used.

Micah Kornfield
  • 1,325
  • 5
  • 10
  • read_row_group only guarantees that it read a single row "group" from a Parquet file instead of a row – WY Hsu Dec 28 '19 at 10:20
2

This is an old question but the follwoing worked for me if you want to read all chunks in one liner without using concat:

pd.read_parquet("chunks_*", engine="fastparquet")

or if you want to read specific chunks you can try:

pd.read_parquet("chunks_[1-2]*", engine="fastparquet")

(this way you will read only the first two chunks, it is also not necessary to specify an engine)

George Farah
  • 79
  • 10
  • does this only work if the files are physically partitioned? – xiaodai Aug 06 '21 at 05:46
  • what do you mean by physically partitioned? If you refer to some partitions that are made by Dask for example, then yes it works. And if this method did not work for you, you could try: pd.read_parquet("your_parquet_path/") or pd.read_parquet("your_parquet_path/*") and it should work, it depends on which pandas version you have. – George Farah Aug 07 '21 at 14:23
0

You can't use a generator/iterator over a parquet file because it is a compressed file. You need to fully decompress it first.

azizbro
  • 3,069
  • 4
  • 22
  • 36
  • No, you can partly decompress it because compressed data is stored in streaming order. pyarrow supports this with iter_batches() and the amount of memory allocated is coherent with partial decompression. – hdante Apr 24 '23 at 15:48
  • Correction: I think I'm wrong and pyarrow actually decompresses the whole row group before returning a slice with iter_batches(). – hdante Apr 24 '23 at 16:01