I have data I want to group by city and day (separate columns) and calculate a new value using the remaining columns. More specifically, the other columns are counts of people by race, for 6 races. Therefore, I have 8 columns, the two grouping ones and the 6 I want to summarize. I want to summarize them by calculating the entropy per city-day.
However, city and day are strings, and my entropy function does not like that. It works when the grouping columns are int64. I have tried to convert the city and day columns to dummy variables, but the error remains.
Borrowing from this post, below is an example using my function that works.
# The function
def newEntropy(x):
A = x
pA = A / A.sum()
Shannon2 = -np.nansum(pA * np.log2(pA))
return Shannon2
# Make fake data
df = pd.DataFrame(np.random.rand(20,5), columns=list('abcde'))
df['group'] = [0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5]
df['group2'] = [6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10]
# Works
df.groupby(['group', 'group2']).apply(newEntropy)
# Having an index column that is a string causes failure
df['group2'] = df['group2'].astype('str')
df.groupby(['group', 'group2']).apply(newEntropy)
I need to figure out how to make newEntropy work. It seems like it should ignore the grouping columns, but that is not the case. I would also prefer not to convert 'group2' to int64 because in my real data it is 'YYYY-MM-DD'. My data's equivalent of 'group1' is also a country name, which I prefer to keep as strings.
I should say that I can make a new dataframe that is the grouping that I want and then apply newEntropy to that. It would just be nice to have something more concise, it feels like it should be easier.