How calculate information gain of a word given a set of text?

Question

Given a database containing phrases

Example:

I need to calculate the information gain for each distinct word.

I studied the concepts of entropy and information gain but I'm not sure how to apply it in phrases. I saw this link: https://mariuszprzydatek.com/2014/10/31/measuring-entropy-data-disorder-and-information-gain/ But in my case I have no phrase categories. I need to know which words has greatest infogain given only the phrases.

This is more appropriate for CrossValidated, I think, rather than StackOverflow. — juanpa.arrivillaga, Mar 22 '17 at 17:43
You first need to set a value to each sentence before you can figure out how much each word can give you. And you would need more that three sentences. That is much too small of a training set — Buzz, Mar 22 '17 at 17:52
Thank you. My set has 30000 phrases, this is a simple example to explain better. How do you define a value for each sentence? Manually? — Washington Luiz, Mar 22 '17 at 17:55
well you would usually find the info gain of a feature of set of data. I'm assuming you are using the words as features for your sentences. to find the info gain of a feature (or in your case, a word) you need the total value for the set of data. you could define the value of the sentence as a binary i.e. "do this" or "don't do this" it might mean a level of how much do do something i.e. on a scale from 1 to 10. it depends on what the sentences mean — Buzz, Mar 22 '17 at 18:01

score 0 · Answer 1 · edited May 23 '17 at 10:29

0

Search for the term tf-idf.
Read this Question, your term set of text == document.

edited May 23 '17 at 10:29

Community

answered Mar 22 '17 at 19:28

stovfl

1 Answers1