2

I'm working with a colleague on the same column data, each of us have a local copy of the data. We have a difference of behavior of Dask regarding how the divisions are seen in the Dask Dataframes loaded from Parquet files:

I work on the data with a local machine, he works on his copy of the data in an AWS instance. He's getting an error related to unknown division that I'm not gettint when I run the same code on my local machine.

The data are 11 Parquet files, about 10-20 millions rows in each file. Each of the 11 files came from a .txt to Parquet conversion with Pandas. The default Pandas to_parquet() was used and only a single row group was made with it for each file.

Each of us load the data with the same Dask code:

df = dd.read_parquet("mydir", columns=['NSPIKES', 'WAVELNTH'])

The dataframe has Datetime index. There's only the 11 Parquet files in mydir, nothing else. Then we get some specifc rows, representing about a million rows, which we need to resample.

nspikes = df[df['WAVELNTH'] == 171]['NSPIKES'].astype(np.uint32)

nspikes = nspikes.resample('48s').mean()

This works completely fine for me working locally, but my colleage on his AWS instance gets this error:

ValueError: Can only resample dataframes with known divisions See Internal Design — Dask documentation for more information.

We also tried an explicit concatenation, using a sorted list of paths to the Parquet files, but same error:

files = sorted(list (Path('mydir').glob('*.parquet')))
ddfs = [dd.read_parquet(f, columns=['NSPIKES', 'WAVELNTH']) for f in files]
df = dd.concat(ddfs, axis=0)

What could cause this discrepancy in the Dask's behavior (working locally, not working on the remote AWS instance), and how to workaround without involving a full load of the data, as my colleague has very little memory in his instance.

[UPDATE 1] The problem is the same when working on just 1 parquet file, and not the 11 files.

Wall-E
  • 623
  • 5
  • 17
  • Is it safe to assume that the environments are identical on both local and AWS instances? Also were the files generated on one machine and uploaded to AWS? (or did you and your colleague generate parquet files separately?) – SultanOrazbayev Jun 23 '22 at 16:50
  • yeah I'd check `dask.__version__`. New features are added to dask all the time. If you saw differences in the results that could be concerning, but dask working on a newer version when it used to break for a tough problem like this is just called "progress" :) – Michael Delgado Jun 23 '22 at 16:52
  • 1
    @SultanOrazbayev I generated the Parquet files, then gave it to him. He uploaded them to his AWS instance. The environments are similar, same OS, but I run it on a local desktop machine, and he runs it in an AWS instance. @MichaelDelgado Dask versions are identical: `2022.01.1` – Wall-E Jun 23 '22 at 20:00

1 Answers1

1

Pandas to_parquet() was used and only a single row group was made with it for each file.

Assuming that the environments are identical on both local and AWS machines, one possible source of the error is missing metadata file on AWS (perhaps only parquet files were uploaded?). This blog post gives examples of saving parquet files, so one option is to test if you get a similar behavior on local machine when specifying kwarg write_metadata_file=False. If so, then uploading the metadata file to AWS might resolve the issue.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • 1
    There were no separate metadata file generated by Pandas when it created the parquet files. It created only one single file per year. – Wall-E Jun 24 '22 at 11:08
  • I see, if it's aligned with your workflow, then it might be possible to run `ddf.to_parquet` on a local machine, so use dask to concat individual parquet files, then ship off the files + metadata to AWS... though this is a hack and doesn't resolve the core of the problem (but also hard to debug without reproducible example). – SultanOrazbayev Jun 24 '22 at 11:45
  • Yes, I tried to do that, ran into memory issues... I posted about it too: https://stackoverflow.com/questions/72690155/dask-dataframe-concatenate-and-repartitions-large-files-for-time-series-and-corr/72691448?noredirect=1#comment128407814_72691448 – Wall-E Jun 24 '22 at 19:09