2

Basically, sklearn has naive bayes with Gaussian kernel which can class numeric variables.

However, how to deal with data set containing numeric variables and category variables together.

For example, give a dataset below, how use sklearn train mixed data type together without discreting numeric variables?

+-------+--------+-----+-----------------+
| Index | Gender | Age | Product_Reviews |
+-------+--------+-----+-----------------+
| A     | Female |  20 | Good            |
| B     | Male   |  21 | Bad             |
| C     | Female |  25 | Bad             |
+-------+--------+-----+-----------------+

I mean, for Bayes classification, P(A|B)= P(B|A)*P(A)/P(B).

For category variables, P(B|A) is easy to count out, but for numeric variables, it should follows Gaussian distribution. And assume we have got P(B|A) with Gaussian distribution.

Is there any package can directly work with these together?

Please be note: this question is not duplicated with How can I use sklearn.naive_bayes with (multiple) categorical features? and Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

Because this question is not wanna do a naive bayes with dummy variables(1st question) and also do not wanna do a model ensemble(2nd question solution2).

The mathematic algorithm is here https://tom.host.cs.st-andrews.ac.uk/ID5059/L15-HsuPaper.pdf , which calculates conditional probabilities with Gaussian distribution instead of counting number with numeric variables. And make classification with all conditional probabilities including category variables(by counting number) and numeric variables(Gaussian distribution)

hashlash
  • 897
  • 8
  • 19
Jiachen
  • 23
  • 1
  • 7
  • Possible duplicate of [How can I use sklearn.naive\_bayes with (multiple) categorical features?](http://stackoverflow.com/questions/38621053/how-can-i-use-sklearn-naive-bayes-with-multiple-categorical-features) – Ami Tavory Aug 26 '16 at 19:32

1 Answers1

2

The answer comes directly from the mathematics of Naive Bayes

  1. Categorical variables provide you with log P(a|cat) ~ SUM_i log P(cat_i|a) + log P(a) (I am omitting division by P(cat), as what NB implementation returns is also ignoring it)

  2. Continuous variables give you the same thing, log P(a|con) ~ SUM_i log P(con_i|a) + log P(a) (I am omitting division by P(cat), as what NB implementation returns is also ignoring it)

and since in Naive Bayes features are independent we get that for x which contains both categorical and continuous

P(a|x) ~ SUM_i log(x_i | a) + log P(a) = SUM_i log P(cat_i|a) + log P(a) + SUM_i log P(con_i|a) + log P(a) - log P(a) = log likelihood from categorical model + log likelihood from continuous model - log prior of class a

all these elements you can read out from your two models, independently fitted to each part of the data. Notice that this is not an ensemble, you simply fit two models and construct one on your own due to specific assumptions of naive bayes, thus you are overcoming implementational limitation this way, yet still efficiently constructing valid NB model on mixed distributions. Note that this works for any set of mixed distributions, thus you could do the same given more different NBs (using different distributions).

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Thanks. However, may I ask based on these, how to select features. – Jiachen Aug 27 '16 at 08:45
  • Well this fully depends on your data and on how do you know that the feature is categorical or not. Somtimes it is easy to decide (string vs number) and sometimes it is more complex (since sometimes numbers are actually codes for categorical things) and efficiently you have to divide by hand. If your data is in .arff format, this should provide you with features types in the header. – lejlot Aug 27 '16 at 10:09
  • Well, maybe I should change some words-----how to make the model better? Ehh, I mean if all data is category, we can just output the feature importance, but with some category and some continuous, is there any better method or tools to do that? – Jiachen Aug 27 '16 at 12:54
  • Feature selection is not the way to "make a model better". This missconception seems to come up surprisingly often. If your problem is not extremely simple - do not use Naive Bayes in the first place, this is not a strong model, and it rarely works really well(as you need extremely clean and uncorrelated and specific type of data). Instead of trying to merge many extremely simple techniques-it is often better to simply work with a single, strong one. While each "simple" technique is reasonable on its own-there is no guarantee that combining them has any sense. Try joint optimization instead. – lejlot Aug 27 '16 at 14:04