2

I want to try to save many large Pandas DataFrames, that will not fit into memory at once, into a single Parquet file. We would like to have a single big parquet file on disk in order to quickly grab columns we need from that single big file.

Specifically we have ~200 small parquet files each with ~100 columns (genes) and 2 million rows (cells). Each parquet on disk is pretty small at ~40MB and ~8GB total for all ~200 Parquet files. The data is very sparse (>90% of values are zeros) and Parquet does a good job compressing the data on disk to a small size.

Since the dataset is sparse we can use Pandas/Scipy sparse arrays to load all ~25,000 genes (columns) and 2 million rows (cells) into a single sparse data structure. However, we can not write a SparseDataFrame directly to Parquet (see Github issue https://github.com/pandas-dev/pandas/issues/26378) and converting the entire matrix to dense would cause us to run out of memory (e.g. a dense array of 2,000 columns/genes and 2 million rows/cells takes up 30GB of RAM). This prevents us from producing the single large Parquet file we want.

This presentation from Peter Hoffmann (https://youtu.be/fcPzcooWrIY?t=987 at 16min 20second) mentions that you can stream data into a Parquet file (keep records about metadata) without keeping all the data in memory. Is it possible to stream columns/rows into a parquet file? I could not find an example of this using Pandas. Do PyArrow or FastParquet support this?

Nick Fernandez
  • 1,160
  • 1
  • 10
  • 24
  • This question raises a similar issue https://stackoverflow.com/questions/54008975/streaming-parquet-file-python-and-only-downsampling, but they appear to want to stream in data from a Parquet file. For our purposes streaming in data from a large Parquet file can just be done by reading selected columns. – Nick Fernandez May 31 '19 at 14:28

0 Answers0