0

I have list of email addresses in a text file. I have a pattern having character classes that specifies what characters are allowed in the email addresses. Now from that input file, I want to only search the email addresses that has the characters other than the allowed ones. I am trying to write a gawk for the same, but not able to get it to work properly. Here is the gawk that I am trying:

gawk -F "," ' $2!~/[[:alnum:]@\.]]/ { print "has invalid chars" }' emails.csv

The problem I am facing is that the above gawk command only matches the records that has NONE of the alphanumeric, @ and . (dot) in them. But what I am looking for is the records that are having the allowed characters but along with them the not-allowed ones as well.

For example, the above command would find

"_-()&(()%"

as the above only has the characters not in regex pattern, but will not find

"abc-123@xyz,com"

. as it also has the characters that are present in specified character classes in regex pattern.

iamharish15
  • 1,760
  • 1
  • 17
  • 20

4 Answers4

1

How about several tests together: contains an alnum and an @ and a dot and an invalid character

$2 ~ /[[:alnum:]]/ && $2 ~ /@/ && $2 ~ /\./ && $2 ~ /[^[:alnum:]@.]/
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • `code` $2 ~ /[[:alnum:]@\.]/ && $2 ~ /[^[:alnum:]@\.]/ `code` – iamharish15 Oct 14 '16 at 15:16
  • Problem with that is it will match a line like `@+` because you're looking for *one of* alnum or @ or dot. My suggestion enforces *all of* those three – glenn jackman Oct 14 '16 at 15:55
  • matching all 3 of those wouldn't get you much closer to validating an email address though, `.7@`, for example would pass the test, so YMMV to the benefits of testing all 3 vs `/[[:alnum:@.]/`. `$2 ~ /[[:alnum:]].*@.*\./` would be better but still leaves a lot of holes. – Ed Morton Oct 14 '16 at 20:57
  • @iamharish15 as mentioned elsethread you do not need to escape `.` within a bracket expression and testing a regexp and then the negation of that regexp is just another form of code duplication which is always a bad approach. – Ed Morton Oct 14 '16 at 21:02
0

Your regex is wrong here:

/[[:alnum:]@\.]]/

It should be:

/[[:alnum:]@.]/

Not removal of an extra ] fron end.

Test Case:

# regex with extra ]
awk -F "," '{print ($2 !~ /[[:alnum:]@.]]/)}' <<< 'abc,ab@email.com'
1

# correct regex
awk -F "," '{print ($2 !~ /[[:alnum:]@.]/)}' <<< 'abc,ab@email.com'
0
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • No, actually, that extra bracket was by mistake. But for your test case, even with the correct email address, it matches the pattern. – iamharish15 Oct 14 '16 at 15:09
  • No that's not correct. `awk -F "," ' $2 !~ /[[:alnum:]@.]/ { print "has invalid chars" }' <<< 'abc,ab@email.com'` will output that message. But first one with extra `]` won't work. – anubhava Oct 14 '16 at 15:18
  • Also see my updated answer with different results with an extra `]` – anubhava Oct 14 '16 at 15:34
0

Do you really care whether the string has a valid character? If not (and it seems like you don't), the simple solution is

$2 ~ /[^[:alnum:]@.]/{ print "has invalid chars" }

That won't trigger on an empty string, so you might want to add a test for that case.

rici
  • 234,347
  • 28
  • 237
  • 341
0

Your question would REALLY benefit from some concise, testable sample input and expected output as right now we're all guessing at what you want but maybe this does it?

awk -F, '{r=$2} gsub(/[[:alnum:]@.]/,"",r) && (r!="") { print "has invalid chars" }' emails.csv

e.g. using the 2 input examples you provided:

$ cat file
_-()&(()%
abc-123@xyz,com

$ awk '{r=$0} gsub(/[[:alnum:]@.]/,"",r) && (r!="") { print $0, "has invalid chars" }' file
abc-123@xyz,com has invalid chars

There are more accurate email regexps btw, e.g.:

\<[[:alnum:]._%+-]+@[[:alnum:]_.-]+\.[[:alpha:]]{2,}\>

which is a gawk-specific (for word delimiters \< and \>) modification of the one described at http://www.regular-expressions.info/email.html after updating to use POSIX character classes.

If you are trying to validate email addresses do not use the regexp you started with as it will declare @ and 7 to each be valid email addresses.

See also How to validate an email address using a regular expression? for more email regexp details.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185