0

I am trying to build a home-make spam filter. and want to write a regular expression to match the following pattern. How can I do that? thanks.

UBmDNFZGrvtbFtxWMq

but not these kind with space or number.

$800

Not Sure

I have a form for user input feedback, something like that. I am trying to detect a spam message. I try to use google reCaptcha web service. But it seems the difficult level is high and I don't like that. I think it will stop some users input again if user type it wrong at the first time. I also try some span filter web service vendor but it looks like the user message would send to their server. I dont feel comfortable about that.

So I come up with an idea to build a patten matching function to validate some user input form value. This question is one of pattern I want to match.

Community
  • 1
  • 1
easycoder
  • 305
  • 2
  • 3
  • 12

1 Answers1

6

I wouldn't bother trying to make a spam filter. This problem has already been solved well by many others such as SpamAssassin.

However a solution might look something like this regular expression to detect a long string of letters:

/\b[A-Za-z]{18,}\b/

A refinement to avoid false matches on legitimate 18 letter words is to check for something that rarely occurs in normal words, such as a capital letter occuring after a lower case letter:

/(?:[A-Z]*[a-z]+[A-Z])[A-Za-z]{18,}\b/

This might still give some false matches (the name "SpamAssassin" for example is just a few letters short of matching this regular expression). It will work correctly for the examples you provided and most ordinary text - but not so well for code examples.

Spam detection generally uses many more sophisticated techniques that can't be replicated using regular expressions alone. It might be better to look at other metrics such as the letter frequency of each of the letters, and to check if the word is found in a dictionary. Often there is no single technique that gives good results - a combination of technqiues is required with a score rating for each. If an email triggers too many of the high scoring rules then it is marked as spam, but if it only hits a few of the low scoring ones then it might be acceptable. The scoring system could be made user configurable.

Edit: Regarding the update to the question, since this is for data entry on a web form one of the standard approaches to prevent spam is to use a CAPTCHA such as reCAPTCHA.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • I've given this +1. However, where you say spam detection is generally more sophisticated than regex, you're only partly right: SpamAssasin, which you linked to, uses several methods of detection, but that includes a whole bunch of regular expressions, which the user can add to using a config file. – Spudley Jan 31 '11 at 21:12
  • @Spudley: Thanks for your comment - I have tried to improve the wording to make the intent more clear- I hope it's better now. PS: I'm actually aware that SpamAssassin uses regular expressions for many of its rules, and in fact SpamAssassin even kindly demonstrated why regular expressions aren't always the best approach: http://stackoverflow.com/questions/2007252/what-is-causing-the-2010-bugs/2007328#2007328 – Mark Byers Jan 31 '11 at 21:24
  • I try to use reCAPTCHA, the difficult level is kind of high for general user I think. – easycoder Jan 31 '11 at 21:38
  • @easycoder: reCAPTCHA is not the only CAPTCHA product - there are some other competing products that are easier for humans to answer. Unfortunately though if they are easier for humans to answer they are often also easier for machines to crack. – Mark Byers Jan 31 '11 at 22:07