Find first position where pattern matching failed.

Question

i am trying to find the common errors users have while entering email ids. I can always validate EMAIL using PHP Email Filter

$email = "someone@exa mple.com";

if(!filter_var($email, FILTER_VALIDATE_EMAIL))
{
  echo "E-mail is not valid";
}
else
{
 echo "E-mail is valid";
}

or pattern matching

$email = test_input($_POST["email"]);
if (!preg_match("/([\w\-]+\@[\w\-]+\.[\w\-]+)/",$email))
{
  $emailErr = "Invalid email format"; 
}

I agree that these are not full proof ways to validate emails. However they should capture 80% of cases.

What I want is - Which position email became invalid? if its a space, at what position user had entered space. or did it fail because of "." in the end?

Any pointers?

-Ajay

PS : I have seen other thread regarding email validations. I can add complexity and make it 100%. concern here is to capture the most common mistakes made by people when entering Email ID.

The PHP regex engine doesn't allow introspection, so you can't tell where in the string a regex match failed. Also note that that position doesn't have to be the position where the error is (just the position where the regex engine could definitely figure out that an overall match would be impossible). So I guess this approach won't work. — Tim Pietzcker, Feb 19 '14 at 07:15
I stumbled upon this site yesterday through another stack overflow post: http://regex101.com/ It has a debug mode which supposedly walks through your regular expression and shows at each position what was matched (or not matched) in your string. — Quixrick, Feb 19 '14 at 14:03

score 1 · Answer 1 · answered Feb 19 '14 at 07:23

This is difficult because sometimes it's not always a single character that makes an email address invalid. The example you give could easily be solved by:

$position = strpos('someone@exa mple.com', ' ');

However, it seems you are not interested in an all encompassing solution but rather something that will catch the majority of character based errors. I would take the approach of using the regular expression but capture each section of the email address in a sub pattern for further validation. For example:

$matches = null;
$result = preg_match("/(([\w\-]+)\@([\w\-]+)\.([\w\-]+))/", $email, $matches);
var_dump($matches);

By capturing sections of the regex validation in sub patterns you could then dive further into each section and run similar or different tests to determine where the user went wrong. For example you could try and match up the TLD of the email address against a whitelist. Of course there are also much more robust email validators in frameworks like Zend or Symfony that will tell you more specifically WHY an email address is not valid, but in terms of knowing which specific character position is at fault (assuming it's a character that is at fault) I think a combination of tactics would work best.

score 1 · Answer 2 · edited May 23 '17 at 11:45

There is no way I know of in Java to report back the point at which a regex failed. What you could do is start building a set of common errors (as described by Manu) that you can check for (this might or might not use regex expressions). Then categorize into these known errors and 'other', counting the frequency of each. When an 'other' error occurs, develop a regex that would catch it.

If you want some assistance with tracking down why the regex failed you could use a utility such as regexbuddy, shown in this answer.

score 0 · Answer 3 · answered Feb 19 '14 at 07:22

Just implement some checks on your own:

Point at the end:

if(substr($email, -1) == '.')
    echo "Please remove the point at the end of you email";

Spaces found:

$spacePos = strpos($email, ' ');
if(spacePos  !== false)
   echo  "Please remove the space at pos: ".$spacePos;

And so on...

score 0 · Answer 4 · answered Feb 19 '14 at 10:39

First of all, I would like to say that the reason your example fails is not the space. It is the lack of '.' in former part and lack of '@' in the latter part. If you input

'someone@example.co m' or 's omeone@example.com', it will success.

So you may need 'begin with' and 'end with' pattern to check strictly.

There is no exist method to check where a regular expression match fails as I know since check only gives the matches, but if you really want to find it out , we can do something by 'break down' the regular expression.

Let's take a look at your example check.

preg_match ("/^[\w\-]+\@[\w\-]+\.[\w\-]+$/",'someone@example.com.');

If it fails, you can check where its 'sub expression' successes and find out where the problem is:

$email = "someone@example.com.";
if(!preg_match ("/^[\w\-]+\@[\w\-]+\.[\w\-]+$/",$email)){ // fails because the final '.'
    if(preg_match("/^[\w\-]+\@[\w\-]+\./",$email,$matches)){ // successes

        $un_match = "[\w\-]+"; // What is taken from the tail of the regular expression.
        foreach ($matches as $match){
            $email_tail = str_replace($match,'',$email); // The email without the matching part. in this case : 'com.'
            if(preg_match('/^'.$un_match.'/',$email_tail,$match_tails)){ // Check and delete the part that tail match the sub expression. In this example, 'com' matches /[\w\-]+/ but '.' doesn't. 
                $result = str_replace($match_tails[0],'',$email_tail);
            }else{
                $result = $email_tail;
            }
        }
    }
}
var_dump($result); // you will get the last '.'

IF you understand the upper example, then we can make our solution more common, for instance, something like below:

$email = 'som eone@example.com.';
    $pattern_chips = array(
        '/^[\w\-]+\@[\w\-]+\./' => '[\w\-]+',
        '/^[\w\-]+\@[\w\-]+/' => '\.',
        '/^[\w\-]+\@/' => '[\w\-]+',
        '/^[\w\-]+/' => '\@',
    );
    if(!preg_match ("/^[\w\-]+\@[\w\-]+\.[\w\-]+$/",$email)){
      $result = $email;
      foreach ($pattern_chips as $pattern => $un_match){
        if(preg_match($pattern,$email,$matches)){
          $email_tail = str_replace($matches[0],'',$email);
          if(preg_match('/^'.$un_match.'/',$email_tail,$match_tails)){
            $result = str_replace($match_tails[0],'',$email_tail);
          }else{
            $result = $email_tail;
          }
          break;
        }
      }
      if(empty($result)){
        echo "There has to be something more follows {$email}";
      }else{
        var_dump($result);
      }
    }else{
      echo "success";
    }

and you will get output:

string ' eone@example.com.' (length=18)

Find first position where pattern matching failed.

4 Answers4