I want to try to save many large Pandas DataFrames, that will not fit into memory at once, into a single Parquet file. We would like to have a single big parquet file on disk in order to quickly grab columns we need from that single big file.
Specifically we have ~200 small parquet files each with ~100 columns (genes) and 2 million rows (cells). Each parquet on disk is pretty small at ~40MB and ~8GB total for all ~200 Parquet files. The data is very sparse (>90% of values are zeros) and Parquet does a good job compressing the data on disk to a small size.
Since the dataset is sparse we can use Pandas/Scipy sparse arrays to load all ~25,000 genes (columns) and 2 million rows (cells) into a single sparse data structure. However, we can not write a SparseDataFrame directly to Parquet (see Github issue https://github.com/pandas-dev/pandas/issues/26378) and converting the entire matrix to dense would cause us to run out of memory (e.g. a dense array of 2,000 columns/genes and 2 million rows/cells takes up 30GB of RAM). This prevents us from producing the single large Parquet file we want.
This presentation from Peter Hoffmann (https://youtu.be/fcPzcooWrIY?t=987 at 16min 20second) mentions that you can stream data into a Parquet file (keep records about metadata) without keeping all the data in memory. Is it possible to stream columns/rows into a parquet file? I could not find an example of this using Pandas. Do PyArrow or FastParquet support this?