1

I am currently working on a bayesian spam filter, made a filter using an algorithm, but it wil not work for long emails, there are just too much values to multiply and it excedes the range of double. I thought about making it so that I only take 10 or 20 most important (highest values for both spam and ham) and multiply only them. I thought about making another Dictionary inside and then multiply values out of it.

This is how it looks right now:

if (countsWordOccurenceSpam.ContainsKey(word.Key) && (!countsWordOccurenceOk.ContainsKey(word.Key)))
{
    int spamValue = 0;
    countsWordOccurenceSpam.TryGetValue(word.Key, out spamValue);
    totals = spamValue ;

    fprob_spam = ((double)spamValue) / ile_spam;

        sum_spam = (((weight * probability) + (totals * fprob_spam)) / (totals + weight));
        sum_ok = ((weight * probability) / (totals + weight));

        sum_spam = Math.Pow(sum_spam, word.Value);
        sum_ok = Math.Pow(sum_ok, word.Value);

        wp_spam_1 = wp_spam_1 * sum_spam;
        last_o_1 = last_o_1 * sum_ok;                       
} 

This is one part of algorithm, now I am thinking about putting all the values from sum_spam to one Dictionary, and all the values from sum_ok to another and take using .Take(10) to select 10 highest values and multiply all of them.

Does it seem right? I am really thinking it would be very inefficient, Is there any way to do it?

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • [getting top k of a list can be done pretty efficiently](http://stackoverflow.com/q/10751953/572670). However, I doubt that is really a good solution for spam filtering. Why don't you use standard Machine Learning algorithms for it? – amit May 04 '15 at 11:37
  • I think I am using a standard one? I am implementing one from a book I've bought. It is naive bayesian, checks each word probability and then multiply all of that probabilities. If the total spam probability is higher I am putting the mail into spam. – Ken'ichi Matsuyama May 04 '15 at 11:43

0 Answers0