I am currently working on a bayesian spam filter, made a filter using an algorithm, but it wil not work for long emails, there are just too much values to multiply and it excedes the range of double
. I thought about making it so that I only take 10 or 20 most important (highest values for both spam and ham) and multiply only them. I thought about making another Dictionary
inside and then multiply values out of it.
This is how it looks right now:
if (countsWordOccurenceSpam.ContainsKey(word.Key) && (!countsWordOccurenceOk.ContainsKey(word.Key)))
{
int spamValue = 0;
countsWordOccurenceSpam.TryGetValue(word.Key, out spamValue);
totals = spamValue ;
fprob_spam = ((double)spamValue) / ile_spam;
sum_spam = (((weight * probability) + (totals * fprob_spam)) / (totals + weight));
sum_ok = ((weight * probability) / (totals + weight));
sum_spam = Math.Pow(sum_spam, word.Value);
sum_ok = Math.Pow(sum_ok, word.Value);
wp_spam_1 = wp_spam_1 * sum_spam;
last_o_1 = last_o_1 * sum_ok;
}
This is one part of algorithm, now I am thinking about putting all the values from sum_spam to one Dictionary
, and all the values from sum_ok
to another and take using .Take(10)
to select 10 highest values and multiply all of them.
Does it seem right? I am really thinking it would be very inefficient, Is there any way to do it?