I want to learn a Naive Bayes model for a problem where the class is boolean. Some of the features are boolean, but other features are categorical and can take on a small number of values (~5).
If all my features were boolean then I would want to use sklearn.naive_bayes.BernoulliNB
. It seems clear that sklearn.naive_bayes.MultinomialNB
is not what I want.
One solution is to split up my categorical features into boolean features. For instance, if a variable "X" takes on values "red", "green", "blue", I can have three variables: "X is red", "X is green", "X is blue". That violates the assumption of conditional independence of the variables given the class, so it seems totally inappropriate.
Another possibility is to encode the variable as a real-valued variable where 0.0 means red, 1.0 means green, and 2.0 means blue. That also seems totally inappropriate to use GaussianNB (for obvious reasons).
I don't understand how to fit what I am trying to do into the Naive Bayes models that sklearn gives me.
[Edit to explain why I don't think multinomial NB is what I want]:
My understanding is that in multinomial NB the feature vector consists of counts of how many times a token was observed in k
iid samples.
My understanding is that this is a fit for document of classification where there is an underlying class of document, and then each word in the document is assumed to be drawn from a categorical distribution specific to that class. A document would have k
tokens, the feature vector would be of length equal to the vocabulary size, and the sum of the feature counts would be k
.
In my case, I have a number of bernoulli variables, plus a couple categorical ones. But there is no concept of the "counts" here.
Example: classes are people who like or don't like math. Predictors are college major (categorical) and whether they went to graduate school (boolean).
I don't think this fits multinomial since there are no counts here.