0

In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?

For example, I can loop through the unique values of the letter column and apply one at a time:

df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
                   ['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]

for letter in df['letter'].unique():
    sz=len(df[df['letter'] == letter])
    df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)

results in:

  letter  value  grp_sum  prop_of_total  prop_of_grp
0      a      1       12       0.015493     0.293481
1      a      3       12       0.114027     0.043973
2      a      2       12       0.309150     0.160818
3      a      6       12       0.033999     0.501729
4      b      7       16       0.365276     0.617484
5      b      5       16       0.144502     0.318075
6      b      4       16       0.017552     0.064442

but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.

I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.

Thoughts? Suggestions? Solutions?

bhansenme
  • 5
  • 1

1 Answers1

0

IIUC, do a transform():

def direchlet(x, size=1):
    return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])

df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)

Output:

  letter  value  grp_sum  prop_of_total  prop_of_grp
0      a      1       12       0.102780     0.127119
1      a      3       12       0.079201     0.219648
2      a      2       12       0.341158     0.020776
3      a      6       12       0.096956     0.632456
4      b      7       16       0.193970     0.269094
5      b      5       16       0.012905     0.516035
6      b      4       16       0.173031     0.214871
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • This worked exactly as desired. As I learned, 'value' in your reply can be any column in the dataframe; I've used the same column in my function: `df['prop_of_grp'] = df.groupby('letter')['letter'].transform(direchlet)` but it could have been grp_sum or prop_of_total and still get the same results, but there has to be some column there. Thanks for the help! – bhansenme Nov 15 '19 at 14:13
  • @bhansenme that is true; `value` can be replace by any column because we only want `len` of the group. – Quang Hoang Nov 15 '19 at 14:15