-1

Given a database containing phrases

Example:

  1. check work slow

  2. work wallpapers

  3. work needed reply notification working groups

I need to calculate the information gain for each distinct word.

  1. IG('work')
  2. IG('check')
  3. ....

I studied the concepts of entropy and information gain but I'm not sure how to apply it in phrases. I saw this link: https://mariuszprzydatek.com/2014/10/31/measuring-entropy-data-disorder-and-information-gain/ But in my case I have no phrase categories. I need to know which words has greatest infogain given only the phrases.

  • This is more appropriate for CrossValidated, I think, rather than StackOverflow. – juanpa.arrivillaga Mar 22 '17 at 17:43
  • You first need to set a value to each sentence before you can figure out how much each word can give you. And you would need more that three sentences. That is much too small of a training set – Buzz Mar 22 '17 at 17:52
  • Thank you. My set has 30000 phrases, this is a simple example to explain better. How do you define a value for each sentence? Manually? – Washington Luiz Mar 22 '17 at 17:55
  • 1
    well you would usually find the info gain of a feature of set of data. I'm assuming you are using the words as features for your sentences. to find the info gain of a feature (or in your case, a word) you need the total value for the set of data. you could define the value of the sentence as a binary i.e. "do this" or "don't do this" it might mean a level of how much do do something i.e. on a scale from 1 to 10. it depends on what the sentences mean – Buzz Mar 22 '17 at 18:01
  • Thanks Buzz. Best – Washington Luiz Mar 22 '17 at 19:10

1 Answers1

0

Search for the term tf-idf.
Read this Question, your term set of text == document.

interpreting-the-sum-of-tf-idf-scores-of-words-across-documents

Community
  • 1
  • 1
stovfl
  • 14,998
  • 7
  • 24
  • 51