0

Probably a continuation of this question, working from the dask docs examples for map_partitions.

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],     'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)

from random import randint

def myadd(df):
    new_value = df.x + randint(1,4)
    return new_value

res = ddf.map_partitions(lambda df: df.assign(z=myadd)).compute()
res

In the above code, randint is only being called once, not once per row as I would expect. How come?

Output:

X Y Z

1 1 4

2 2 5

3 3 6

4 4 7

5 5 8

mdurant
  • 27,272
  • 5
  • 45
  • 74
F.D
  • 767
  • 2
  • 10
  • 23

1 Answers1

2

If you performed the same operation (df.x + randint(1,4)) on the original pandas dataframe, you would only get one random number, added to every previous value of the column. This is doing exactly the same as the pandas case, except that it is being called once for each partition - this is what map_partition does.

If you wanted a new random number for every row, you should first think of how you would achieve this with pandas. I can immediately think of two:

df.x.map(lambda x: x + random.randint(1, 4))

or

df.x + np.random.randint(1, 4, size=len(df.x))

If you replace your newvalue = line with one of these, it will work as expected.

mdurant
  • 27,272
  • 5
  • 45
  • 74