2

I'm looking for some regex to match valid emails (doesn't need to be some whopping RFC-compatible job) and people trying to trick the system with invalid email addresses.

Examples of things I want to catch:

  • blah@blah.com
  • blah@blah.org
  • blah@blah.weirdtld
  • blat [ AT ] blah.com
  • blah[at]blah.com
  • blah@blah[ DOT]com
  • blah@blah[ dot ].com
  • etc.

I'm sure someone out there has published a tried-and-tested expression of all known permutations, but if they have, I can't find it, and would love to see it.

I don't care if it catches domains by accident, as they are being stripped anyway.

A real-world example of what this could be used for is eBay. Seller wants to put in their description "Contact me on: bob@example.com for a cheaper price" as they would not have to pay listing fees. I want to catch that address, regardless of how it is written.

I appreciate it's impossible to check everything, and this is not a replacement for human intervention (which is also a part of the validation process already, I'm just trying to make their lives easier).

I have already searched StackOverflow and Google, but unfortunately it's one of those problems which can be difficult to search for. If anyone has a link to a solution I would be very grateful.

Edit: Just to clarify even more. This is NOT to be used to check if an email address is valid or not. This is to be used to stop people entering valid email addresses AND email addresses with common substitutions into a textarea ([at] for @, [dot] for ., (d0t) for ., and so on and so forth).

Mike
  • 8,767
  • 8
  • 49
  • 103
  • A regex isn't going to catch this. In fact, there is no foolproof way to do so other than sending them an email and forcing them to reply/activate. – John Conde Dec 02 '13 at 16:14
  • This question has had a lot of discussion [here](http://stackoverflow.com/questions/46155/validate-email-address-in-javascript). – ajp15243 Dec 02 '13 at 16:15
  • @JohnConde this is not for a sign up email. It's to stop people entering contact information into a 'Description' field which is publically viewable. It will never be fool-proof, but something is better than nothing in this instance. – Mike Dec 02 '13 at 16:27
  • @ajp15243 not really - that's for validating email addresses. I'm looking for a way to match valid email addresses (which your link discusses) and email addresses with @'s and .'s substituted, and what common substitutions may be. – Mike Dec 02 '13 at 16:31
  • @mikemike I should have said a *similar* question has discussion, then, as I meant that link more as a reference than indicating that it is a duplicate. – ajp15243 Dec 02 '13 at 16:35
  • 2
    I don't really understand the downvotes that the question poster is getting - I think it's quite a reasonable request, fully accepting that there is by no means a perfect solution out there but evidently there are many such solutions in use on a lot of popular websites. – BrynJ Dec 02 '13 at 16:58

2 Answers2

0

I guess if even heavy spammers haven't found an easy way to overcome this problem, you won't have much luck here, either.

there are several reasons why it's a suicidal task to think about an algorithm for this, but the main one is human creativity vs machine stupidity.

  1. There are literally infinite ways to camouflage an email address, for example test @ domain.com (remove spaces) or test[d0t]again atsign domain[.com] (it took me 2 seconds to think about them and you surely can decode them without any issue.

  2. Even if you can list every possible alternative (which is an inhuman task, anyway), somebody else will design a different scheme to hide their email contact (example: place email address inside an inline image)

Just by comparison, here is the best regex out there to simply detect valid email addresses that covers every RFC822 case.

STT LCU
  • 4,348
  • 4
  • 29
  • 47
  • This is NOT for an email address field. It is to stop people entering contact information in a text area. Hence, the RFC is pretty pointless, as a far more simple regex will grab valid email addresses, it's the invalid ones with [at], etc. which I need. It will never be perfect, but a handful of the most-used permutations in a regex format is what I am looking for. – Mike Dec 02 '13 at 16:29
  • 1
    @mikemike I know that it isn't to validate an email field. It's still impossible to detect camouflaged email addresses: see my points 1 and 2 to know why I say so. – STT LCU Dec 02 '13 at 16:33
  • I get that and agree, but there must be something to match the most common ones. I by no means expect to grab everything. There are many places I have seen this in use (eBay being the example used above) so it must be a common issue. – Mike Dec 02 '13 at 16:36
0

See: How to Find or Validate an Email Address.

Excerpt:

...there's often a trade-off between what's exact, and what's practical.

The virtue of my regular expression above is that it matches 99% of the email addresses in use today. All the email address it matches can be handled by 99% of all email software out there. If you're looking for a quick solution, you only need to read the next paragraph. If you want to know all the trade-offs and get plenty of alternatives to choose from, read on.

To catch expressions that are likely aliases for an email address, just do a second test for [AT], [ at ], [DOT], etc. For example, here is a RegEx that does just that (the i qualifier tells Perl to ignore case):

/\[\s*(AT|DOT)\s*\]/i
DavidRR
  • 18,291
  • 25
  • 109
  • 191