1

I need to use pd.cut on a dask dataframe.

This answer indicates that map_partitions will work by passing pd.cut as the function.

It seems that map_partitions passes only one partition at a time to the function. However, pd.cut will need access to an entire column of my df in order to create the bins. So, my question is: will map_partitions in this case actually operate on the the entire dataframe or am I going to get incorrect results with this this approach?

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
dgLurn
  • 51
  • 7

1 Answers1

2

In your question you correctly identify why the bins should be provided explicitly.

By specifying the exact bin cuts (either based on some calculation or external reasoning), you ensure that what dask does is comparable across partitions.

# this does not guarantee comparable cuts
ddf['a'].map_partitions(pd.cut)

# this ensures the cuts are as per the specified bins
ddf['a'].map_partitions(pd.cut, bins)

If you want to generate bins in an automatic way, one way is to get the min/max for the column of interest and generate the bins with np.linspace:

# note that computation is needed to give
# actual (not delayed) values to np.linspace
bmin, bmax = dask.compute(ddf['a'].min(), ddf['a'].max)

# specify the number of desired cuts here
bins = np.linspace(bmin, bmax, num=123)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks! Is it also then correct that if I specify just an integer for bins, comparable cuts would also not be guaranteed? Is there any way with dask to have it run some function (pd. cut or otherwise) on an entire column? Since .apply only allows axis of 1, .apply is not an option. Ideally I would pass an integer for bins and let pd.cut determine the bins vs. separately determining them first. – dgLurn Jun 01 '21 at 15:08
  • See updated answer, though it's not ideal due to the intermediate computation. – SultanOrazbayev Jun 01 '21 at 17:30
  • 1
    Thanks much. This accomplished what I was looking to do. – dgLurn Jun 02 '21 at 18:44