How can I find the entropy of a term in a collection of documents?

Asked Nov 15 '20 at 18:17

Active Nov 16 '20 at 00:09

Viewed 377 times

I have a dataframe with the column text and I'd like to tokenize it and compute the entropy of each word so that I can remove the ones with higher entropy.

This is the formula I'd like to use:

Entropy

But I receive errors if I run the code I tried to write on Python:

nltk.download('punkt')
Data = nltk.word_tokenize(str(df['text'])) #tokenize text

counter=Counter(Data)
df = pd.DataFrame.from_dict(counter, orient='index').reset_index()
df=df.rename(columns={"index": "words", 0: "entropy"}) #I created a df with words and temporarily their frequencies


shannon_entropy_value = 0
m = len(df) #number of documents
p_i = df['entropy'] / float(m) #probability of a word in a doc
entropy_i = (p_i*np.log(p_i))/log(m)
shannon_entropy_value += entropy_i

df['entropy']=shannon_entropy_value

I get error "cast string to float is not supported"

And I'm also not sure if I interpreted the formula correctly.

edited Nov 16 '20 at 00:09

desertnaut

57,590
26
140
166

asked Nov 15 '20 at 18:17

fantasticouru

1

What errors do you receive? Please edit the _full_ error message into your question. – DisappointedByUnaccountableMod Nov 15 '20 at 18:43
Not sure if some of these answers help but worth checking out: [Fastest way to compute entropy in Python](https://stackoverflow.com/questions/15450192/fastest-way-to-compute-entropy-in-python) – Bill Nov 15 '20 at 22:10

How can I find the entropy of a term in a collection of documents?

0 Answers0