0

I have a dataframe with the column text and I'd like to tokenize it and compute the entropy of each word so that I can remove the ones with higher entropy.

This is the formula I'd like to use:

Entropy

But I receive errors if I run the code I tried to write on Python:

nltk.download('punkt')
Data = nltk.word_tokenize(str(df['text'])) #tokenize text

counter=Counter(Data)
df = pd.DataFrame.from_dict(counter, orient='index').reset_index()
df=df.rename(columns={"index": "words", 0: "entropy"}) #I created a df with words and temporarily their frequencies


shannon_entropy_value = 0
m = len(df) #number of documents
p_i = df['entropy'] / float(m) #probability of a word in a doc
entropy_i = (p_i*np.log(p_i))/log(m)
shannon_entropy_value += entropy_i

df['entropy']=shannon_entropy_value

I get error "cast string to float is not supported"

And I'm also not sure if I interpreted the formula correctly.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    What errors do you receive? Please edit the _full_ error message into your question. – DisappointedByUnaccountableMod Nov 15 '20 at 18:43
  • Not sure if some of these answers help but worth checking out: [Fastest way to compute entropy in Python](https://stackoverflow.com/questions/15450192/fastest-way-to-compute-entropy-in-python) – Bill Nov 15 '20 at 22:10

0 Answers0