I'm working with a colleague on the same column data, each of us have a local copy of the data. We have a difference of behavior of Dask regarding how the divisions are seen in the Dask Dataframes loaded from Parquet files:
I work on the data with a local machine, he works on his copy of the data in an AWS instance. He's getting an error related to unknown division that I'm not gettint when I run the same code on my local machine.
The data are 11 Parquet files, about 10-20 millions rows in each file. Each of the 11 files came from a .txt to Parquet conversion with Pandas. The default Pandas to_parquet()
was used and only a single row group was made with it for each file.
Each of us load the data with the same Dask code:
df = dd.read_parquet("mydir", columns=['NSPIKES', 'WAVELNTH'])
The dataframe has Datetime
index. There's only the 11 Parquet files in mydir
, nothing else.
Then we get some specifc rows, representing about a million rows, which we need to resample.
nspikes = df[df['WAVELNTH'] == 171]['NSPIKES'].astype(np.uint32)
nspikes = nspikes.resample('48s').mean()
This works completely fine for me working locally, but my colleage on his AWS instance gets this error:
ValueError: Can only resample dataframes with known divisions See Internal Design — Dask documentation for more information.
We also tried an explicit concatenation, using a sorted list of paths to the Parquet files, but same error:
files = sorted(list (Path('mydir').glob('*.parquet')))
ddfs = [dd.read_parquet(f, columns=['NSPIKES', 'WAVELNTH']) for f in files]
df = dd.concat(ddfs, axis=0)
What could cause this discrepancy in the Dask's behavior (working locally, not working on the remote AWS instance), and how to workaround without involving a full load of the data, as my colleague has very little memory in his instance.
[UPDATE 1] The problem is the same when working on just 1 parquet file, and not the 11 files.