1

I have data I want to group by city and day (separate columns) and calculate a new value using the remaining columns. More specifically, the other columns are counts of people by race, for 6 races. Therefore, I have 8 columns, the two grouping ones and the 6 I want to summarize. I want to summarize them by calculating the entropy per city-day.

However, city and day are strings, and my entropy function does not like that. It works when the grouping columns are int64. I have tried to convert the city and day columns to dummy variables, but the error remains.

Borrowing from this post, below is an example using my function that works.

# The function
def newEntropy(x):
    A = x

    pA = A / A.sum()
    Shannon2 = -np.nansum(pA * np.log2(pA))

    return Shannon2

# Make fake data
df = pd.DataFrame(np.random.rand(20,5), columns=list('abcde'))
df['group'] = [0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5]
df['group2'] = [6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10]

# Works
df.groupby(['group', 'group2']).apply(newEntropy)

# Having an index column that is a string causes failure
df['group2'] = df['group2'].astype('str')
df.groupby(['group', 'group2']).apply(newEntropy)

I need to figure out how to make newEntropy work. It seems like it should ignore the grouping columns, but that is not the case. I would also prefer not to convert 'group2' to int64 because in my real data it is 'YYYY-MM-DD'. My data's equivalent of 'group1' is also a country name, which I prefer to keep as strings.

I should say that I can make a new dataframe that is the grouping that I want and then apply newEntropy to that. It would just be nice to have something more concise, it feels like it should be easier.

ZacharyST
  • 658
  • 2
  • 6
  • 22

1 Answers1

1

How abut specific the column you want to apply the function after groupby

df.groupby(['group', 'group2'])[list('abcde')].apply(newEntropy)
Out[191]: 
group  group2
0      6         6.057044
       7        -0.000000
1      7         4.485942
2      7         4.879091
       8         3.727744
       9        -0.000000
3      9         4.751447
4      9        -0.000000
       10        8.993928
5      10        4.191522
dtype: float64
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    Awesome, thanks. One slight modification: df.groupby(['group', 'group2'])[list('abcde')].agg('sum').apply(newEntropy) – ZacharyST May 08 '19 at 16:14