How to calculate entropy from frequency table?

Question

I have data on a bunch of names (> 10 million) and their associated counts.

import pandas as pd
import numpy as np


data = {
    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
    "Count": [20, 10, 5, 2, 5],
}

df = pd.DataFrame(data)
print(df)

    Name  Count
0   Sara     20
1   John     10
2   Mark      5
3  Peter      2
4   Kate      5

I want to compute the entropy of the Count column WITHOUT expanding the data to be like [Sara, Sara, Sara,...,Kate, Kate, Kate] because there are just too many observations for that.

How would I compute entropy of Count without expanding the data?

Isn't entropy easily calculated based upon the counts by converting them to bin probabilities (each count divided by total count), and summing -pi*log(pi) (see https://stackoverflow.com/questions/15450192/fastest-way-to-compute-entropy-in-python) — DarrylG, Sep 23 '19 at 12:29

yatu · Accepted Answer · 2019-09-23T13:13:45.763

2

Assuming the dataframe is containing the value counts for each name, you can directly feed a Series of counts to scipy.stats.entropy:

from scipy.stats import entropy

entropy(df.set_index('Name').squeeze())
# 1.3466893828909594

As @nils mentions, if what you want is the binary entropy you can set base=2

edited Sep 23 '19 at 13:13

answered Sep 23 '19 at 12:30

yatu

86,083
12
84
139

OP is most likely looking for the Shannon entropy, so you need to set the kwarg `base=2`. – Nils Werner Sep 23 '19 at 12:51

score 0 · Answer 2 · answered Sep 23 '19 at 12:43

0

If you want to compute de Shannon's entropy H = -Sum[ P(xi) * log2( P(xi)) ].

import pandas as pd
import numpy as np
import math


data = {
    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
    "Count": [20, 10, 5, 2, 5],
}

df = pd.DataFrame(data)
df['prob'] = df['Count']/df['Count'].sum()
df['log'] = df.apply(lambda x: math.log(x['prob'],2),axis=1)
df['prod'] = df['prob']*df['log']

print('Entropy: ', -df['prod'].sum())

answered Sep 23 '19 at 12:43

Ricardo Sanchez

704
6
21

1

There is no need for `apply()` and `math.log()`, but instead you can do it all vectorized. – Nils Werner Sep 23 '19 at 12:56

How to calculate entropy from frequency table?

2 Answers2