2

I have to check column in csv to find valid emails and keep them while removing invalid data from that column. I already have an AWK command with simple regex but some of the invalid emails are not filtered with that. Below is that command

awk 'BEGIN{FS=OFS=","}{$1=match($1,/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}/)?substr($1,RSTART,RLENGTH):"";print}'

But I want to replace this regex pattern with RFC 5322 compliant regex. I found following regex but it doesn't work when I add it to above awk command. How can I insert this regex pattern to above AWK command?

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Csv sample is below

-pedja-@mail.ru,abd
0.5maratonac@gmail.com,534
00dovla.@gmail.com,5rfrf
015.josa@gmail.com,54rf
02142..6584@nadlanu.com,54r4
0616080668.boki@gmail.com,5443
0@0..com,344545
.100.three.7@gmail.com,64
10867249ld@emailgg.xyz,54444

I tried below command

awk 'BEGIN{FS=OFS=","}{$1=match($1,/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}/)?substr($1,RSTART,RLENGTH):"";print}'

Expetected output

-pedja-@mail.ru,abd
0.5maratonac@gmail.com,534
,5rfrf
015.josa@gmail.com,54rf
,54r4
0616080668.boki@gmail.com,5443
,344545
,64
10867249ld@emailgg.xyz,54444
john@,4355

(00dovla.@gmail.com,02142..6584@nadlanu.com,0@0..com,.100.three.7@gmail.com,john@) are not valid emails and they are removed)
halfer
  • 19,824
  • 17
  • 99
  • 186
Gammix
  • 93
  • 5
  • 1
    That regexp is a PCRE. Awk doesn't support PCREs, only EREs. So you can't put that regexp in an awk command. – Ed Morton Apr 27 '23 at 12:34
  • See http://www.regular-expressions.info/email.html for a regexp that matches emails. You may need to tweak it based on the tool you're using for that matching. I personally use `(addr ~ /^[0-9a-zA-Z._%+-]+@[0-9a-zA-Z.-]+\.[a-zA-Z]{2,}$/) && (addr ~ /^.*[a-zA-Z]{2}.*@.*[a-zA-Z]{2}.*\.[a-zA-Z]{2,}$/)` in awk with the second check just to get rid of strings like `x@y.co` that are technically valid email addresses but are more likely just noise in the input. It could probably be done in one regexp but I was lazy. Good luck! – Ed Morton Apr 27 '23 at 12:36
  • Oh, and see [how-can-i-validate-an-email-address-using-a-regular-expression](https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression) for more info on validating email addresses per that RFC. – Ed Morton Apr 27 '23 at 12:38

1 Answers1

1

How can I insert this regex pattern to above AWK command

In order to make this regex pattern work with your AWK command, you have to

  • imbed the contained ' quotes in the single quoted program-text by replacing each of the two ' with '\''
  • remove all ?: in the pattern
  • anchor the pattern to the beginning and end of $1 by using /^…$/
halfer
  • 19,824
  • 17
  • 99
  • 186
Armali
  • 18,255
  • 14
  • 57
  • 171