0

I am new to bash and awk, and I have spent days trying to learn it. I think I am very close to the solution, but not completely there. So, request for your help. Do note, I do not wish to use grep, since I find it to be much slower.

I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists print it to a file. Do note I am using Cygwin on windows10 (not sure if that matters)

Text file:

!bar@foo.com,address
#john@foo.com;address
john@foo.com;address µÖ
email1@foo.com;username;address
email2@foo.com;username
  email3@foo.com,username;address   [spaces at the start of the row]
 email4@foo.com|username|address   [tabs at the start of the row]

Code:

awk -F'[,|;: \t]+' '{
    gsub(/^[ \t]+|[ \t]+$/, "")
    if (NF>1 && tolower($1) ~ /[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+/)
    {
        r=gensub("[,|;: \t]+",":",1,$0)
        print r > "file_good"
    }
    else
        print $0 > "file_ignore"
}' *.txt

Expected output into: file_good

email1@foo.com:username;address
email2@foo.com:username
email3@foo.com:username;address
email4@foo.com:username|address

Issue with the code:

  1. I can't find a way to filter out non-ascii characters (non printable characters).
  2. For some reason the code allowed rows without a valid email address. For example: !bar@foo.com ; #john@foo.com ; etc

Any help would be much appreciated!

rogerwhite
  • 335
  • 4
  • 16
  • This comment might help: https://stackoverflow.com/questions/2898463/using-grep-to-find-all-emails#comment2947889_2898463 – Cyrus May 20 '20 at 15:36
  • [edit] your question to show the expected output given your posted sample input (so we can see, for example, if you want a row printed or just an email address within the row. Also add to your sample input/output at least 1 case where you have multiple email addresses on a row and cases where you have text that might be misinterpreted as an email address but is actually invalid. – Ed Morton May 20 '20 at 16:34
  • `#john@foo.com` isn't a valid email address but `john@foo.com` is so explain in your question how a tool should know whether `#john@foo.com` is a valid email address `john@foo.com` that just happens to have a `#` before it vs an invalid email address. – Ed Morton May 20 '20 at 16:43
  • Hi Ed. the regex that I have used should have been able to filter out '#' or any other special characters... That's what I thought. No idea where the bug is: `tolower($1) ~ /[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+/` – rogerwhite May 20 '20 at 23:32
  • It's anchoring as @user... says but beyond that there's a few issues with that regexp. 1) escaping `-` isn't portable, just put it at the start or end of the bracket expression instead. 2) `.` and `+` aren't metachars inside a bracket expression so they shouldn't be escaped. 3) Use of character range for lower case letters isn't portable, you should use a character class instead. So instead of writing `/^[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+$/` you should write it as `/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/` and change `tolower($1)` to `$1` since the RE now handles all cases. – Ed Morton May 21 '20 at 15:46
  • See also https://stackoverflow.com/q/201323/1745001 and https://stackoverflow.com/a/40051448/1745001 for more accurate regexps for matching an email address. I'd use `\<[[:alnum:]._%+-]+@[[:alnum:]_.-]+\.[[:alpha:]]{2,}\>` if I were you since you're already using gawk. – Ed Morton May 21 '20 at 15:55

2 Answers2

1

Whilst there are other complexities relating to the stated goal, the main reason why your original awk program did not work as expected is that the regex lacked anchoring:

tolower($1) ~ /^[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+$/

$1 ~ /.../ is changed to $1 ~ /^...$/. Also the r=gensub part of original program doesn't appear to be doing anything useful (I didn't see r anywhere else). gensub is specific to GNU awk - it could be that in this case all that's needed is sub.

  • Thanks user13586221... I have corrected my code above, to showcase the use of variable "r". Should I replace gensub with gsub? If so, what's the correct code? – rogerwhite May 21 '20 at 04:42
  • @rogerwhite Yes that's fine to use with GNU awk. If you don't need both `r` and the original `$0`, `r=gensub("[,|;: \t]+",":",1)` could be replaced with `sub("[,|;: \t]+",":")`. The main difference is that `sub`& `gsub` modify the original string and return the number of substitutions made, whereas `gensub` returns the modified string. `gensub` has other features like backreferences & the third arg to replace *g*lobally or the *n*th match. –  May 21 '20 at 05:17
0

This isn't a complete solution, but I can think of a few preliminary steps which will probably make the rest of the process much simpler.

cat textfile | tr ';' '\n' | tr ',' '\n' | tr '\|' '\n' > textfile2
mv textfile2 textfile
sed -n '/\@/p' textfile > emails
sed -i '/\@/d' textfile

What that will do, is try and turn all of those delimiters into newlines, which will have the effect of putting the delimited fields on seperate lines. After that, a brute force search for all lines containing a '@' symbol will hopefully give you at least a few email addresses, which you can then dump out to a seperate file, and delete from the original. From there, you can probably build a similar heuristic for pulling out the usernames and snail addresses, if you can find a common anchor.

In my experience, regular expressions can induce literal migraines. Wherever possible, I try and use the simplest solution I can. As mentioned, this most likely isn't perfect; but it's a start.

petrus4
  • 616
  • 4
  • 7