I have the code below, which works successfully (kudos to @EdMorton), and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.
Next, it checks if any of the output files are large than a certain size, if so, that file is sub-split by the 3rd character.
This would take about 10 mins to process 1 GB worth of logs (on my laptop). Can this be made faster? Any help will be appreciated.
Sample log file
"email1@foo.com:datahere2
email2@foo.com:datahere2
email3@foo.com datahere2
email5@foo.com;dtat'ah'ere2
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ
email3@foo.com:datahere2
Expected Output
# cat em
email1@foo.com:datahere2
email2@foo.com:datahere2
email3@foo.com:datahere2
email5@foo.com:dtat'ah'ere2
email3@foo.com:datahere2
# cat _leftover
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ
Code:
#/usr/bin/env bash
Func_Clean(){
pushd $1 > /dev/null
awk '
{
gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
sub(/[,|;: \t]+/, ":")
if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:/ && /^[\x00-\x7F]*$/) {
print
}
else {
print >> "_leftover"
}
}
' * |
sort -t':' -k1,1 |
awk '
{ curr = tolower(substr($0,1,2)) }
curr != prev {
close(Fpath)
Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
prev = curr
}
{
print >> Fpath
# print | "gzip -9 -f >> " Fpath # Throws an error
} ' && rm *.txt
find * -type f -prune -size +1000000c \( ! -iname "_leftover" \) |while read FILE; do
awk '
{ curr = tolower(substr($0,1,3)) }
curr != prev {
close(Fpath)
Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
prev = curr
}
{
print >> Fpath
# print | "gzip -9 -f >> " Fpath # Throws an error
} ' "$FILE" && rm "$FILE"
done
#gzip -9 -f -r . # This would work, but is it effecient?
popd > /dev/null
}
### MAIN - Starting Point ###
BASE_FOLDER="_test2"
for dir in $(find $BASE_FOLDER -type d);
do
if [ $dir != $BASE_FOLDER ]; then
echo $dir
time Func_Clean "$dir"
fi
done