7

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.

Am I wrong to attempt to do this without using a spark framework?

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated!

Sjoseph
  • 853
  • 2
  • 14
  • 23

1 Answers1

3

Spark is certainly a viable choice for this task.

We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • Thank you for the tips! I have queried the file using 'num_row_groups' and my file has only 1 'row_group'. I assume this means I won't have anything to gain by using 'read_row_group'? – Sjoseph Jan 02 '19 at 17:03
  • 1
    No, you will only gain something from this when you write your Parquet files with multiple row groups. When using `pyarrow` for writing them, you should set the `chunk_size` argument to the number of rows that fit nicely into RAM. But beware that the smaller you set this argument, the slower reading gets. You're probably best of by setting it `chunk_size=len(table) / 60` so that you get 100 MiB chunks. – Uwe L. Korn Jan 02 '19 at 17:31
  • 1
    Thank you for the suggestion but I do not have control of the parquet file format. I assume my only option is to get set-up with pyspark/spark? – Sjoseph Jan 02 '19 at 17:42