I am new to bash and awk, and I have spent days trying to learn it. I think I am very close to the solution, but not completely there. So, request for your help. Do note, I do not wish to use grep, since I find it to be much slower.
I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists print it to a file. Do note I am using Cygwin on windows10 (not sure if that matters)
Text file:
!bar@foo.com,address
#john@foo.com;address
john@foo.com;address µÖ
email1@foo.com;username;address
email2@foo.com;username
email3@foo.com,username;address [spaces at the start of the row]
email4@foo.com|username|address [tabs at the start of the row]
Code:
awk -F'[,|;: \t]+' '{
gsub(/^[ \t]+|[ \t]+$/, "")
if (NF>1 && tolower($1) ~ /[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+/)
{
r=gensub("[,|;: \t]+",":",1,$0)
print r > "file_good"
}
else
print $0 > "file_ignore"
}' *.txt
Expected output into: file_good
email1@foo.com:username;address
email2@foo.com:username
email3@foo.com:username;address
email4@foo.com:username|address
Issue with the code:
- I can't find a way to filter out non-ascii characters (non printable characters).
- For some reason the code allowed rows without a valid email address. For example: !bar@foo.com ; #john@foo.com ; etc
Any help would be much appreciated!