I have the code below, which works successfully, and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.
This would take about 12-14 mins to process 1 GB worth of logs (on my laptop). Can this be made faster?
Is it possible to run this is parallel? I am aware I could do }' "$FILE" &
. However, I tested and that does not help much. Is it possible to ask awk to output in parallel - what is the equivalent of print $0 >> Fpath &
?
Any help will be appreciated.
Sample log file
"email1@foo.com:datahere2
email2@foo.com:datahere2
email3@foo.com datahere2
email5@foo.com;dtat'ah'ere2
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ
email3@foo.com:datahere2
Expected Output
# cat em
email1@foo.com:datahere2
email2@foo.com:datahere2
email3@foo.com:datahere2
email5@foo.com:dtat'ah'ere2
email3@foo.com:datahere2
# cat errorfile
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ
Code:
#/bin/sh
pushd "_test2" > /dev/null
for FILE in *
do
awk '
BEGIN {
FS=":"
}
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
$0=gensub("[,|;: \t]+",":",1,$0)
if (NF>1 && $1 ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+$/ && $0 ~ /^[\x00-\x7F]*$/)
{
Fpath=tolower(substr($1,1,2))
Fpath=gensub("[^[:alnum:]]","_","g",Fpath)
print $0 >> Fpath
}
else
print $0 >> "errorfile"
}' "$FILE"
done
popd > /dev/null