1

I was writing a small wrapper for nullmailer, when I noticed, imho, an unwanted behavior in grep. In particular I noticed something strange with @s.

It does break strings containing @ and will produce wrong output.

TL;DR

E-mail addresses have some rules to follow (E.G. RFC 2822), so I will use a deliberately wrong regular expression for them, just to keep things a bit shorter. Note that this will not change the problem I'm asking for.

I am using e-mail addresses in this post, but the problem is obviously for every string with at least a @ in it.

I wrote a small script to help me explain what I "found":

#!/bin/bash

funct1() {

  arr=(local1@domain.tld local2@domain.tld)
  regex="[[:alnum:]]*@[[:alpha:]]*\.[[:alpha:]]\{2,\}"
  for dest in ${arr[@]}; do
    printf "%s\n" "$dest" | grep -o -e "$regex"
  done
}
funct2() {
  arr=(local1@domain.tld local2@domain.tld)
  regex="[[:alpha:]]*@[[:alpha:]]*\.[[:alpha:]]\{2,\}"
  for dest in ${arr[@]}; do
    printf "%s\n" "$dest" | grep -o -e "$regex"
  done
}

funct3(){
  arr=(local1@dom1@ain.tld local2@dom2@ain.tld)
  regex="[[:alpha:]]*@[[:alpha:]]*@[[:alpha:]]*\.[[:alpha:]]\{2,\}"
  for dest in ${arr[@]}; do
    printf "%s\n" "$dest" | grep -o -e "$regex"
  done
}

funct4(){
  arr=(local1@dom1@ain.tld local2@dom2@ain.tld)
  regex="[[:alpha:]]*@[[:alnum:]]*@[[:alpha:]]*\.[[:alpha:]]\{2,\}"
  for dest in ${arr[@]}; do
    printf "%s\n" "$dest" | grep -o -e "$regex"
  done
}

printf "One @, all parts of regex right:\n"
funct1
printf "One @, first part of regex wrong:\n"
funct2
printf "Two @, first and second part of regex wrong:\n"
funct3
printf "Two @, first part of regex wrong:\n"
funct4
exit 0

To better understand the problem, I used two types of strings: local1@domain.tld and local1@dom1@ain.tld and it seems to me that grep does not behave in the correct way with strings containing at least a @.

The output is:

One @, all parts of regex right:
local1@domain.tld
local2@domain.tld

One @, first part of regex wrong:
@domain.tld
@domain.tld

Two @, first and second part of regex wrong:

Two @, first part of regex wrong:
@dom1@ain.tld
@dom2@ain.tld

funct1 has a regular expression that solves the entire strings, so no problem, all of them are printed.

funct2 has a regular expression that solves only the strings from @ to the end, so what I should expect is no output, because of the wrong expression; instead, what I have is the second part of the strings...

That is why I decided to add the second @ in the string and do some tests.

funct3 solves only the strings from the second @ to the end, so what I should expect is no output at all because of the mistake in the regex; Ok, no output.

funct4 instead has a regular expression that solves only the strings from the first @ to the end, so what I should expect in here is that he can not show me anything; instead, what I have is the output from first @, just as funct2.

Except for funct1 I shouldn't have any output at all, I am right?

Why does grep break the result at the first @?

I consider it an unwanted behavior because this way the result will consists in strings that don't match my expression entirely.

Am I missing something?

EDIT: deleter tag undefined-behavior

ingroxd
  • 995
  • 1
  • 12
  • 28
  • Your function calls do not match the function names used. – cdarke May 25 '18 at 13:12
  • 5
    What you're missing is how `grep` works : it's perfectly happy extracting matches from a string when you ask it to, which is why `funct2` gives you a partial match. Use anchors (`^` and `$`) representing respectively the start and end of a string (of a line for grep) to force your pattern to match only complete lines (therefore validating their format). – Aaron May 25 '18 at 13:13
  • 5
    You realize that the regex quantifier `*` matches **zero** or more characters, right? so `[[:alpha:]]*@` matches `1234@` because there are zero alphabetic chars before the @ – glenn jackman May 25 '18 at 13:14
  • @cdrake: corrected, thanks @glennjackman: You are right. I feel so dumb now lol. I substituted `*` with `\+`, now it works as intended. – ingroxd May 25 '18 at 13:53
  • Take Aaron's advice too: use anchors. – glenn jackman May 25 '18 at 14:36

1 Answers1

1

Your regex has issues, working as designed. You could also just count the number of @ as a test as well. Personally I would create a boolean method like this :

#!/bin/bash

# -- is email address valid ? --    
function isEmailValid() {
      echo "$1" | egrep -q "^([A-Za-z]+[A-Za-z0-9]*((\.|\-|\_)?[A-Za-z]+[A-Za-z0-9]*){1,})@(([A-Za-z]+[A-Za-z0-9]*)+((\.|\-|\_)?([A-Za-z]+[A-Za-z0-9]*)+){1,})+\.([A-Za-z]{2,})+"
}


if isEmailValid "_#@us@.com" ;then
        echo "VALID "
else
        echo "INVALID"
fi


if isEmailValid "us@ibm.com" ;then
        echo "VALID "
else
        echo "INVALID"
fi

Or more simply:

function isEmailValid() {
      regex="^([A-Za-z]+[A-Za-z0-9]*((\.|\-|\_)?[A-Za-z]+[A-Za-z0-9]*){1,})@(([A-Za-z]+[A-Za-z0-9]*)+((\.|\-|\_)?([A-Za-z]+[A-Za-z0-9]*)+){1,})+\.([A-Za-z]{2,})+"
      [[ "${1}" =~ $regex ]]
}
Mike Q
  • 6,716
  • 5
  • 55
  • 62
  • updated it, I have seen issues in the past but I will leave it as you have suggested .. tx – Mike Q May 26 '18 at 04:50
  • In bash you don't need echo, a pipe, and grep to test if a string matches a regexp. See https://stackoverflow.com/a/21112809/1745001 – Ed Morton May 26 '18 at 19:06
  • I find the above totally fine. It works and the point is to show a better way to check email addrs. anyay I'll provide another example, you can tell me if you like it . – Mike Q May 26 '18 at 19:28
  • It'd be inefficient and it'd fail with different input values and/or different versions of echo. There's a lot of different ways you can do it but simply writing `[[ $1 =~ ^(...+ ]]` seems like the obvious choice. – Ed Morton May 26 '18 at 19:32
  • I see your point, I guess I never gave it that much thought. I also don't think optimizing code means picking between built in functions etc. because it's more about readability and BigO. I see that a lot in my current job where people will use stripos over preg_match in PHP because they think it's "helping". I put in something more like yours above but included quotes . – Mike Q May 26 '18 at 22:31
  • I like @MikeQ solution over bash regex for tests... because of portability. I know that in the question I only used bash, but at the time that is what i was using [: – ingroxd Sep 16 '18 at 11:28