I have a dataframe with the column text and I'd like to tokenize it and compute the entropy of each word so that I can remove the ones with higher entropy.
This is the formula I'd like to use:
But I receive errors if I run the code I tried to write on Python:
nltk.download('punkt')
Data = nltk.word_tokenize(str(df['text'])) #tokenize text
counter=Counter(Data)
df = pd.DataFrame.from_dict(counter, orient='index').reset_index()
df=df.rename(columns={"index": "words", 0: "entropy"}) #I created a df with words and temporarily their frequencies
shannon_entropy_value = 0
m = len(df) #number of documents
p_i = df['entropy'] / float(m) #probability of a word in a doc
entropy_i = (p_i*np.log(p_i))/log(m)
shannon_entropy_value += entropy_i
df['entropy']=shannon_entropy_value
I get error "cast string to float is not supported"
And I'm also not sure if I interpreted the formula correctly.