0

I am hoping to have a performance gain by using Dask dataframe over Pandas on a 6-core macbook pro. however Dask is performing as slow as the Pandas dataframe, which takes roughly 5 minutes.

What am I doing wrong here?

ddf = dd.from_pandas(df.set_index('customer seq').sort_index(), npartitions = 8)
ddf = ddf.set_index(ddf.index, sorted = True)
paired = ddf.groupby(ddf.index, group_keys =
False).apply(retention_contract).compute(scheduler='processes')
  • 1
    If you want optimization use vectorization not .apply() https://stackoverflow.com/a/54432584/9936329 – Violatic Jul 12 '19 at 08:51
  • @Violatic I can't vectorize the retention_contract function, it has a rather complex logic in it which takes the whole group as input and calculate various intermediate variables and performs multiple condition check. – siminsimisim Jul 12 '19 at 09:15

1 Answers1

0

Performance depends on a large number of things. It's quite common for Dask DataFrame to not provide a speed up over Pandas, especially for datasets that fit comfortably into memory.

However, if your apply function is quite slow then you might consider using processes instead of threads (which is the default for dask dataframe) especially if that function is GIL-bound. See https://docs.dask.org/en/latest/scheduling.html for more information.

In general though using groupby-apply is just going to add a ton of overhead, regardless of whether you're using Pandas or Dask Dataframe.

MRocklin
  • 55,641
  • 23
  • 163
  • 235