how to use tf-idf with Naive Bayes?

Question

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :

Link 1

Link 2

Link 3

Link 4

etc.

Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:

Naive-Bayes formula :

P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))

tf-idf weighting can be employed in the above formula as:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class) 

total_unique_words_in_all_classes : as is.

This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.

Please suggest me. I am new to this domain.

score 8 · Accepted Answer · answered May 24 '16 at 08:49

It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:

You have a number of documents, each of which has a number of words.
You would like to classify documents into categories.
Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.

Your Solution

The tf idf you gave is the following:

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)

Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.

Another potential Solution

It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.

If you wanted to use this method:

Compute mean, variation of tf-idf values for each class.
Compute the prior using a gaussian distribution generated by the above mean and variation.
Proceed as normal (multiply to prior) and predict values.

Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.

Additional methods to increase

Apart from the above, you could also use the following techniques to increase accuracy:

Preprocessing:
1. Feature reduction (usually NMF, PCA, or LDA)
2. Additional features
Algorithm:

Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression
Misc.

Bootstrapping, boosting, etc. Be careful not to overfit though...

Hopefully this was helpful. Leave a comment if anything was unclear

score 2 · Answer 2 · answered Sep 26 '16 at 07:44

P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes (basically vocabulary of words in the entire training set))

How would this sum up to 1? If using the above conditional probabilities, I assume the SUM is

P(word1|class)+P(word2|class)+...+P(wordn|class) = (total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)

To correct this, I think the P(word|class) should be like

(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))

Please correct me if I am wrong.

score 1 · Answer 3 · answered Jun 22 '18 at 20:33

I think there are two ways to do it:

Round down tf-idf as integers, then use the multinomial distribution for the conditional probabilities. See this paper https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf.
Use Dirichlet distribution which is a continuous version of the multinomial distribution for the conditional probabilities.

I am not sure if Gaussian mixture will be better.

how to use tf-idf with Naive Bayes?

3 Answers3

Your Solution

Another potential Solution

Additional methods to increase