41

I'm new to machine learning, and for my first project I'd like to write a naive Bayes spam filter. I was wondering if there are any publicly available training sets of labeled spam/not spam emails, preferably in plain text and not a dump of a relational database (unless they pretty-print those?).

I know such a publicly available database exists for other kinds of text classification, specifically news article text. I just haven't been able to find the same sort of thing for emails.

JeremyKun
  • 2,987
  • 2
  • 24
  • 44
  • 3
    If you're in 2011 with us, just check out your spam box at Gmail. Should be a pretty consistent source of spam emails. ;) – coreyward Jan 20 '11 at 06:31
  • My Gmail account only has about 50 spam messages in it, and each message is deleted after 30 days. Surprisingly, I don't get a lot of spam to begin with. – JeremyKun Jan 22 '11 at 04:07

6 Answers6

34

Here is what I was looking for: http://untroubled.org/spam/

This archive has around a gigabyte of compressed accumulated spam messages dating 1998 - 2011. Now I just need to get non-spam email. So I'll just query my own Gmail for that using the getmail program and the tutorial at mattcutts.com

JeremyKun
  • 2,987
  • 2
  • 24
  • 44
10

Sure, there's Spambase, which is as far as i'm aware, is the most widely cited spam data set in the machine learning literature.

I have used this data set many times; each time i am impressed how much effort has been put into the formatting and documentation of this data set.

A few characteristics of the Spambase set:

  • 4601 data points--all complete

  • each comprised of 58 features (attributes)

  • each data point is labelled 'spam' or 'no spam'

  • approx. 40% are labeled spam

  • of the features, all are continuous (vs. discrete)

  • a representative feature: average continuous sequence of capital letters


Spambase is archived in the UCI Machine Learning Repository; in addition, it's also available on the Website for the excellent ML/Statistical Computation Treatise, Elements of Statistical Learning by Hastie et al.

doug
  • 69,080
  • 24
  • 165
  • 199
  • 2
    This doesn't appear to actually have the email text in it, but rather a count of a particular set of words. Maybe I'm missing where to find the content? – JeremyKun Jan 22 '11 at 04:15
  • No email text? Look again at the 58 features that comprise the data set--most of them are derived entirely from the email text. Raw email text will require careful parsing into features before you can use it in a Naive Bayes. – doug Jan 22 '11 at 07:53
  • 3
    Right, but I want the raw text so I can decide which features are relevant. This is a learning experience, so I want to do it from scratch. – JeremyKun Jan 22 '11 at 19:59
  • Whoa, that Spambase archive is dated 1999-07-01, which is quite a bit older than the ancient [SpamAssassin public corpus](https://spamassassin.apache.org/publiccorpus/) (2002-2005). Spam has changed quite a bit since then! – Adam Katz Mar 15 '16 at 02:14
8

SpamAssassin has a public corpus of both spam and non-spam messages, although it hasn't been updated in a few years. Read the readme.html file to learn what's there.

ViennaMike
  • 2,207
  • 1
  • 24
  • 38
6

You might consider taking a look at the TREC spam/ham corpus (which I think is the collection of emails from Enron that was made public from the court case). TREC generally runs a bunch of competitive text processing tasks, so it might give you some references for comparison.

The downside is that they're stored in raw mbox format, though there are parsers available in many languages (Apache Tika is a good example).

The webpage isn't TREC, but this seems to be a good overview of the task with links to the data: http://plg.uwaterloo.ca/~gvcormac/spam/

  • This is good, and since posting my question I've realized that it's hard to get around using mbox format for email dumps. Anyhow, I've found some data, and decided it's easier to just classify something else (web scraping yelp comments to classify positivity, actually). – JeremyKun Jan 29 '11 at 05:15
4

A more modern one spam training set can be found at kaggle. Moreover, you can test accuracy of your classifier on their website by uploading your results.

warmspringwinds
  • 1,147
  • 2
  • 14
  • 31
2

I have also an answer, here you can find a daily refreshed Bayesian database for initial training and also a daily created archive containing captured spams. You will find the instructions how to use it on the site.

Frantique
  • 177
  • 1
  • 12