I have 11 years of data with a record (row) every second, over about 100 columns. It's indexed with a series of datetime (created with Pandas to_datetime()
)
We need to be able to make some correlation analysis between the columns, that can work just 2 columns loaded at a time. We may be resampling at lower time cadence (e.g. 48s, 1 hours, months, etc...) over up to 11 years and visualize those correlations over the 11 years.
The data are currently in 11 separate parquet files (one per year), individually generated with Pandas from 11 .txt files. Pandas did not partition any of those files. In memory, each of these parquet files load up to about 20GB. The intended target machine will only have 16 GB, loading even just 1 columns over the 11 years takes about 10 GB, so 2 columns will not fit either.
Is there a more effective solution than working with Pandas, for working on the correlation analysis over 2 columns at a time? For example, using Dask to (i) concatenate them, and (ii) repartition to some number of partitions so Dask can work with 2 columns at a time without blowing up the memory?
I tried the latter solution following this post, and did:
# Read all 11 parquet files in `data/`
df = dd.read_parquet("/blah/parquet/", engine='pyarrow')
# Export to 20 `.parquet` files
df.repartition(npartitions=20).to_parquet("/mnt/data2/SDO/AIA/parquet/combined")
but at the 2nd step, Dask blew up my memory and I got a kernel shutdown. As Dask is a lot about working with larger-than-memory data, I am surprise this memory escalation happened.
----------------- UPDATE 1 ROW GROUPS---------------
I reprocessed the parquet files with Pandas, to create about 20 row groups (it had defaulted to just 1 group per file). Now regardless of setting split_row_groups
to True
or False
, I am not able to resample with Dask (e.g. myseries = myseries.resample('48s').mean()
. I have to do compute()
on the Dask series first to get it as a Pandas dataframe, which seems to defeat the purpose of working with the row groups within Dask.
When doing that resampling, I get instead:
ValueError: Can only resample dataframes with known divisions See https://docs.dask.org/en/latest/dataframe-design.html#partitions for more information.
I did not have that problem when I used the default Pandas behavior to write the parquet files with just 1 row group.