Dask dataframe concatenate and repartitions large files for time series and correlation

Question

I have 11 years of data with a record (row) every second, over about 100 columns. It's indexed with a series of datetime (created with Pandas to_datetime()) We need to be able to make some correlation analysis between the columns, that can work just 2 columns loaded at a time. We may be resampling at lower time cadence (e.g. 48s, 1 hours, months, etc...) over up to 11 years and visualize those correlations over the 11 years.

The data are currently in 11 separate parquet files (one per year), individually generated with Pandas from 11 .txt files. Pandas did not partition any of those files. In memory, each of these parquet files load up to about 20GB. The intended target machine will only have 16 GB, loading even just 1 columns over the 11 years takes about 10 GB, so 2 columns will not fit either.

Is there a more effective solution than working with Pandas, for working on the correlation analysis over 2 columns at a time? For example, using Dask to (i) concatenate them, and (ii) repartition to some number of partitions so Dask can work with 2 columns at a time without blowing up the memory?

I tried the latter solution following this post, and did:

# Read all 11 parquet files in `data/`
df = dd.read_parquet("/blah/parquet/", engine='pyarrow')
# Export to 20 `.parquet` files
df.repartition(npartitions=20).to_parquet("/mnt/data2/SDO/AIA/parquet/combined")

but at the 2nd step, Dask blew up my memory and I got a kernel shutdown. As Dask is a lot about working with larger-than-memory data, I am surprise this memory escalation happened.

----------------- UPDATE 1 ROW GROUPS---------------

I reprocessed the parquet files with Pandas, to create about 20 row groups (it had defaulted to just 1 group per file). Now regardless of setting split_row_groups to True or False, I am not able to resample with Dask (e.g. myseries = myseries.resample('48s').mean(). I have to do compute() on the Dask series first to get it as a Pandas dataframe, which seems to defeat the purpose of working with the row groups within Dask.

When doing that resampling, I get instead:

ValueError: Can only resample dataframes with known divisions See https://docs.dask.org/en/latest/dataframe-design.html#partitions for more information.

I did not have that problem when I used the default Pandas behavior to write the parquet files with just 1 row group.

score 1 · Answer 1 · answered Jun 20 '22 at 18:23

dask.dataframe by default is structured a bit more toward reading smaller "hive" parquet files rather than chunking individual huge parquet files into manageable pieces. From the dask.dataframe docs:

By default, Dask will load each parquet file individually as a partition in the Dask dataframe. This is performant provided all files are of reasonable size.

We recommend aiming for 10-250 MiB in-memory size per file once loaded into pandas. Too large files can lead to excessive memory usage on a single worker, while too small files can lead to poor performance as the overhead of Dask dominates. If you need to read a parquet dataset composed of large files, you can pass split_row_groups=True to have Dask partition your data by row group instead of by file. Note that this approach will not scale as well as split_row_groups=False without a global _metadata file, because the footer will need to be loaded from every file in the dataset.

I'd try a few strategies here:

Only read in the columns you need. Since your files are so huge, you don't want dask even trying to load the first chunk to infer structure. You can provide the columns key dd.read_parquet which will be passed through to various stages of the parsing engines. In this case, dd.read_parquet(filepath, columns=list_of_columns).
If your parquet files have multiple row groups, you can make use of the dd.read_parquet argument split_row_groups=True. This will create smaller chunks which are each smaller than the full file size.
If (2) works, you may be able to avoid repartitioning, or if you need to, repartition to a multiple of your original number of partitions (22, 33, etc). When reading data from a file, dask doesn't know how large each partition is, and if you specify a number less than a multiple of the current number of partitions, the partitioning behavior isn't very well defined. On some small tests I've run, repartitioning 11 --> 20 will leave the first 10 partitions as-is and split the last one into the remaining 10!
If your file is on disk, you may be able to read the file as a memory map to avoid loading the data prior to repartitioning. You can do this by passing memory_map=True to dd.read_parquet.

I'm sure you're not the only one with this problem. Please let us know how this goes and report back what works!

I had tried to use split_row_groups=True, but it made things worse. It blew up the memory again, and that did not make sense to me. I haven't explored all the Dask documentation, I haven't come to a part explaining how to make row groups before creating the parquet files. — Wall-E, Jun 21 '22 at 01:44
I don't see the `memory_map` optionals in the documentation, any idea it is documented? — Wall-E, Jun 21 '22 at 01:48
Turned out that Pandas had created just 1 row group, it was the default behaviour. — Wall-E, Jun 21 '22 at 04:30
memory_map is an argument on [`pyarrow.parquet.ParquetDataset`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset) - keyword arguments are passed through to the engine you use. — Michael Delgado, Jun 21 '22 at 05:10

Dask dataframe concatenate and repartitions large files for time series and correlation

1 Answers1

Linked