1

I have a regex email pattern and would like to strip all but pattern-matched characters from the string, in a short I want to sanitize string...

I'm not a regex guru, so what I'm missing in regex?

<?php

$pattern = "/^([\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+\.)*[\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+@((((([a-z0-9]{1}[a-z0-9\-]{0,62}[a-z0-9]{1})|[a-z])\.)+[a-z]{2,6})|(\d{1,3}\.){3}\d{1,3}(\:\d{1,5})?)$/i";

$email = 'contact<>@domain.com'; // wrong email

$sanitized_email = preg_replace($pattern, NULL, $email);

echo $sanitized_email; // Should be contact@domain.com

?>

Pattern taken from: http://fightingforalostcause.net/misc/2006/compare-email-regex.php (the very first one...)

Otar
  • 2,561
  • 1
  • 20
  • 24
  • 2
    No need to escape `!`, `#`, `$`, `%`, `&`, `'`, `*`, `+`, `=`, `?`, `\``, `{`, `|`, `}`, or `~` inside character classes; only `]`, `\‍`, and depending on the position `-` and `^` as well need to be escaped. – Gumbo Feb 06 '11 at 15:09
  • 2
    Don't guess. If the email matches the pattern, it's valid. If it doesn't, tell the user. – aaz Feb 12 '11 at 11:45
  • I think it should be clearly said in the title: string sanitization. I don't need validation with preg_match, but to sanitize wrong email to the correct one. – Otar Feb 12 '11 at 11:51
  • Your linked example regex can be used for validation, not for sanitization. While of course you could sanitize an email address, you cannot "correct" them. If you filter invalid characters, then `contact<>@domain.com` can be fixed. But input like `contact@!name-@name+` cannot with **only** a char sanitization regex. So, character filtering is possible, but structure correction at the same time is not (with typical regex constructs). – mario Feb 14 '11 at 15:36

2 Answers2

6

You cannot filter and match at the same time. You'll need to break it up into a character class for stripping invalid characters and a matching regular expression which verifies a valid address.

$email = preg_replace($filter, "", $email);
if (preg_match($verify, $email)) {
     // ok, sanitized
     return $email;
}

For the first case, you want to use a negated character class /[^allowedchars]/.
For the second part you use the structure /^...@...$/.

Have a look at PHPs filter extension. It uses const unsigned char allowed_list[] = LOWALPHA HIALPHA DIGIT "!#$%&'*+-=?^_\{|}~@.[]";` for cleansing.

And there is the monster for validation: line 525 in http://gcov.php.net/PHP_5_3/lcov_html/filter/logical_filters.c.gcov.php - but check out http://www.regular-expressions.info/email.html for a more common and shorter variant.

mario
  • 144,265
  • 20
  • 237
  • 291
  • Actually, Mario, [this](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579) is **the** validation monster, one which gets zero false negatives and zero false positives on any valid RFC 5322 address. It’s also a lot easier to read, write, debug, and maintain than the one cited. – tchrist Feb 06 '11 at 17:00
  • 1
    @tchrist. Cool. But your example is a Perl regex - that's like cheating! ha. And it appears way more correct than the PHP-internal regex (which does not quite cover the rfc). – mario Feb 06 '11 at 17:09
  • @mario: If you use the preg stuff, it should be fine. I don’t know version of PCRE PHP is linked against; I’ve heard it typically doesn’t include Unicode support, which is a crying shame. But the pattern itself is a PCRE pattern, not strictly a Perl one. – tchrist Feb 06 '11 at 17:39
  • @tchrist: The Unicode support should be there by default, but it's quite outdated nevertheless. `PCRE_VERSION` gives me `7.8` released in 2008. And this one is still bundled with PHP 5.3.3 ಠ_ಠ - I'm looking into pcrepattern.3 at the moment, of which I never had heard before. – mario Feb 06 '11 at 17:46
  • @mario: Yes, *pcrepattern* is marvelous. – tchrist Feb 06 '11 at 17:47
  • Thank you @mario for the reply. All I need to do is remove all not-allowed characters (sanitize) but no validation. I know about PHP filters but the regex that's used to validate email isn't "perfect". It validates emails like name@domain, the regex I've posted is much more reliable for me. – Otar Feb 07 '11 at 06:53
4

i guess filter_var php function can also do this functionality, and in a cleaner way. Have a look at: http://www.php.net/manual/en/function.filter-var.php

example:

 $email = "chris@exam\\ple.com";
 $cleanEmail = filter_var($email, FILTER_SANITIZE_EMAIL);  // chris@example.com
DhruvPathak
  • 42,059
  • 16
  • 116
  • 175