0

I am trying to create 3 PHP regular expressions which do three things..

  1. Gets emails e.g mr.jones@apple-land.com
  2. Gets dates e.g 31/05/90 or 31-Jun-90
  3. Gets nameservers e.g ns1.apple.co.uk

I have a big chunk of text and want to extract these things from it.

What I have so far is:

    $regexp = '/[A-Za-z0-9\.]+[@]{1}[A-Za-z0-9\.]+[A-Za-z]{2,4}/i';
preg_match_all($regexp, $output, $email);

$regexp = '/[A-Za-z0-9\.]+[^@]{1}/i';
preg_match_all($regexp, $output, $nameservers);

$regexp = '/[0-9]{2,4}[-\/]{1}([A-Za-z]{3}|[0-9]{2})[-\/]{1}[0-9]{2,4}/i';
preg_match_all($regexp, $output, $dates);

Dates and emails work, but i dont know if that is an efficient way to do it..

Nameservers just dont work at all.. essentially I want to find any combinations of letters and numbers which have dots in between but not @ symbols..

Any help would be greatly appreciated.

Thanks

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
Thomas Clowes
  • 4,529
  • 8
  • 41
  • 73

4 Answers4

2

RegEx's for emails are fairly complex. This is one place where frameworks shine. Most of the popular ones have validation components which you can use to solve these problems. I'm most familiar with ZendFramework validation, and Symfony2 and CakePHP also provide good solutions. Often these solutions are written to the appropriate RFC specification and include support for things that programmers often overlook, like the fact that + is valid in an email address. They also protect against common mistakes that programmers make. Currently, your email regex will allow an email address that looks like this: .@.qt, which is not valid.

Some may argue that using a framework to validate an email or hostname (which can have a - in it as well) is overkill. I feel it is worth it.

H Hatfield
  • 881
  • 5
  • 9
  • Well, it kinda is overkill, when you can grab the relevant function(s) from the source code, and not use the entire framework. – Zirak Aug 12 '11 at 15:37
  • I agree that it may not be the best solution for every case. Most frameworks are designed so individual components can be used without needing to use the other parts of it. – H Hatfield Aug 12 '11 at 15:45
0

For the nameservers i would suggest using: /[^.](\.[a-z_\d]+){3,}/i

lugte098
  • 2,271
  • 5
  • 21
  • 30
  • You probably mean `/[a-z0-9]+\.[a-z0-9]+/i` – Madara's Ghost Aug 12 '11 at 14:43
  • No, i mean what i said in my answer. Because if you have "ns1.apple.co.uk" for example, then your Regex won't match. Yours can only match "ns1.apple". But now that i take a good look at my code, i have to change it too :P – lugte098 Aug 12 '11 at 14:55
  • The point in my comment is that you forgot to escape your `.` to `\.` in fact, I would use `([\w\d]+\.){2,3}[\w\d]+` for namespaces. – Madara's Ghost Aug 12 '11 at 14:57
  • yeah srry, someone edited my post, so didn't see that it was not escaped :P i agree on most of what you said, just not the \w since you cannot rely completely on what it matches. – lugte098 Aug 12 '11 at 15:07
  • That returns any and all words as far as my tests seem to suggest *confused* – Thomas Clowes Aug 12 '11 at 15:23
0

essentially I want to find any combinations of letters and numbers which have dots in between but not @ symbols..

regexp for finding all letters and numbers which have dots in between:

$regexp '/[A-Za-z0-9]{1,}(\.[A-Za-z0-9]{1,}){1,}/i'

Please note that you don't have to make it explicit you don't want '@' if what you are matching on doesn't include the @.

0

I would recommend using different patterns for your examples:

  • [\w\.-]+@\w+\.[a-zA-Z]{2,4} for emails.
  • \d{1,2}[/-][\da-zA-Z]{1,3}[/-]\d{2,4} for dates.
  • ([a-zA-Z\d]+\.){2,3}[a-zA-Z\d]+ for namespaces.

Good luck ;)

Madara's Ghost
  • 172,118
  • 50
  • 264
  • 308