As long as these replacements do not depend on the values across different rows and hence can be applied in any order, it's possible to achieve this with .map_partitions
:
def apply_masks(df):
# implement the mask logic here, for example
df['outcol'] = df['incol'].mask(df['incol'] == 1, 3)
return df
ddf = ddf.map_partitions(apply_masks)
Note that there is a potential problem with repeated application of .mask()
since it will overwrite the previous results. So, depending on your use case, in the apply_masks
function above the second application of the mask might need to either control for changed values in outcol
or apply the mask to outcol
column (with the caveat that masks will need to be applied in such an order that doesn't lead to a miscalculation, e.g. 1 is remapped into 3 and then then 3 remapped into 1).
If your application is such that you are writing to the same column outcol
, then you probably want .replace
or .map
option (see this answer for a good explanation of the difference between these options).
So, in that case the workflow would be:
import pandas as pd
df = pd.DataFrame(range(10), columns=['incol'])
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
replace_logic = {
1: 3,
2: 8,
3: 2,
# and so on ...
}
ddf['outcol'] = ddf['incol'].map(replace_logic).fillna(ddf['incol']).astype('int')
print(ddf.compute())