Replace multiple values with Dask mask

Question

I have this Dask mask code that sets 3 in a Dask data frame when the value of incol is 1:

ddf['outcol'] = ddf['incol'].mask(ddf['incol'] == 1, 3)

Now, I have to replace the values given 50 conditions such as:

if `incol` == 1 then set 3
if `incol` == 2 then set 8
: : :
: : :

Is it possible to do this in a single Dask statement (doesn't have to be mask)?

SultanOrazbayev · Accepted Answer · 2021-04-04T15:38:29.010

As long as these replacements do not depend on the values across different rows and hence can be applied in any order, it's possible to achieve this with .map_partitions:

def apply_masks(df):
   # implement the mask logic here, for example
   df['outcol'] = df['incol'].mask(df['incol'] == 1, 3)
   return df

ddf = ddf.map_partitions(apply_masks)

Note that there is a potential problem with repeated application of .mask() since it will overwrite the previous results. So, depending on your use case, in the apply_masks function above the second application of the mask might need to either control for changed values in outcol or apply the mask to outcol column (with the caveat that masks will need to be applied in such an order that doesn't lead to a miscalculation, e.g. 1 is remapped into 3 and then then 3 remapped into 1).

If your application is such that you are writing to the same column outcol, then you probably want .replace or .map option (see this answer for a good explanation of the difference between these options).

So, in that case the workflow would be:

import pandas as pd
df = pd.DataFrame(range(10), columns=['incol'])

import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)

replace_logic = {
    1: 3,
    2: 8,
    3: 2,
    # and so on ...
}

ddf['outcol'] = ddf['incol'].map(replace_logic).fillna(ddf['incol']).astype('int')

print(ddf.compute())

if there is no `fillna` then the values not encoded in `replace_logic` will be `nan`. If you want to avoid that, then use `.replace` but it will be slower. — SultanOrazbayev, Apr 04 '21 at 15:51

Replace multiple values with Dask mask

1 Answers1