Make awk efficient (again)

Question

I have the code below, which works successfully (kudos to @EdMorton), and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

Next, it checks if any of the output files are large than a certain size, if so, that file is sub-split by the 3rd character.

This would take about 10 mins to process 1 GB worth of logs (on my laptop). Can this be made faster? Any help will be appreciated.

Sample log file

"email1@foo.com:datahere2     
email2@foo.com:datahere2
email3@foo.com datahere2
email5@foo.com;dtat'ah'ere2 
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ
email3@foo.com:datahere2

Expected Output

# cat em 
email1@foo.com:datahere2     
email2@foo.com:datahere2
email3@foo.com:datahere2
email5@foo.com:dtat'ah'ere2 
email3@foo.com:datahere2

# cat _leftover
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ

Code:

#/usr/bin/env bash
Func_Clean(){
pushd $1 > /dev/null
    awk '
        {
            gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
            sub(/[,|;: \t]+/, ":")
            if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:/ && /^[\x00-\x7F]*$/) {
                print
            }
            else {
                print >> "_leftover"
            }
        } 
    ' * |
    sort -t':' -k1,1 |
    awk '
        { curr = tolower(substr($0,1,2)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { 
            print >> Fpath
            # print | "gzip -9 -f >> " Fpath  # Throws an error
        } ' && rm *.txt

    find * -type f -prune -size +1000000c \( ! -iname "_leftover" \) |while read FILE; do
    awk '
        { curr = tolower(substr($0,1,3)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { 
            print >> Fpath
            # print | "gzip -9 -f >> " Fpath   # Throws an error
        } ' "$FILE" && rm "$FILE"
    done

    #gzip -9 -f -r .    # This would work, but is it effecient?
popd > /dev/null
}

### MAIN - Starting Point ###
BASE_FOLDER="_test2"
for dir in $(find $BASE_FOLDER -type d); 
do
    if [ $dir != $BASE_FOLDER ]; then
        echo $dir
        time Func_Clean "$dir"
    fi
done

The problem is you have given up using `awk` and started cobbling pieces of code together to try and make things work. If you go back and look at Ed's answer you will see he calls `awk` once. How many times do you call `awk` spawning an entire new process each time that must re-read various inputs. `awk` is a highly efficient text processor which can handle all that is needed in a single invocation. Here you pipe `awk` to `sort` to `awk` then run `find` piping to `while` piping to `awk` again. Try reducing the number of processes you invoke. — David C. Rankin, Jun 20 '20 at 02:29
I am sure this can be optimized, am not an expert... hence my search for better code — rogerwhite, Jun 20 '20 at 02:45
Do you need a unique set of output files for the input files in each individual directory or not? — Ed Morton, Jun 20 '20 at 03:18
Ed: Each sub-folder has logs, from different applications. Either I leave the split output files in the same folder, and hence the folder name will help me identify the source. Another option: I could merge all output files, into one big fat folder, but then I would need to append to each `print` the name of the folder, to identify the source. Either could work. I would prefer the former — rogerwhite, Jun 20 '20 at 03:22

Ed Morton · Accepted Answer · 2020-06-21T12:12:55.630

0

Wrt the subject Make awk efficient (again) - awk is extremely efficient, you're looking for ways to make your particular awk scripts more efficient and to make your shell script that calls awk more efficient.

The only obvious performance improvements I see are:

Change:

find * -type f -prune -size +1000000c \( ! -iname "_leftover" \) |
while read FILE; do
    awk 'script' "$FILE" && rm "$FILE"
done

to something like (untested):

readarray -d '' files < <(find . -type f -prune -size +1000000c \( ! -iname "_leftover" \) -print0) &&
awk 'script' "${files[@]}" &&
rm -f "${files[@]}"

so you call awk once total instead of once per file.

Call Func_Clean() once total for all files in all directories instead of once per directory.
Use GNU parallel or similar to run Func_Clean() on all directories in parallel.

I see you're considering piping the output to gzip to save space, that's fine but just be aware that will cost you something (idk how much) in terms of execution time. Also if you do that then you need to close the whole output pipeline as that is what you're writing to from awk, not just the file at the end of it, so then your code would be something like (untested):

    { curr = tolower(substr($0,1,3)) }
    curr != prev {
        close(Fpath)
        Fpath = "gzip -9 -f >> " gensub(/[^[:alnum:]]/,"_","g",curr)
        prev = curr
    }
    { print | Fpath }

This isn't intended to speed things up other than the find suggestion above, it's just a cleanup of the code in your question to reduced redundancy and common bugs (UUOC, missing quotes, wrong way to read output of find, incorrect use of >> vs >, etc.). Start with something like this (untested and assuming you do need to separate the output files for each directory):

#/usr/bin/env bash

clean_in() {
    awk '
        {
            gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
            sub(/[,|;: \t]+/, ":")
            if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:/ && /^[\x00-\x7F]*$/) {
                print
            }
            else {
                print > "_leftover"
            }
        } 
    ' "${@:--}"
}

split_out() {
    local n="$1"
    shift
    awk -v n="$n" '
        { curr = tolower(substr($0,1,n)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { print > Fpath }
    ' "${@:--}"
}

Func_Clean() {
    local dir="$1"
    printf '%s\n' "$dir" >&2
    pushd "$dir" > /dev/null
    clean_in *.txt |
        sort -t':' -k1,1 |
            split_out 2 &&
    rm -f *.txt &&
    readarray -d '' big_files < <(find . -type f -prune -size +1000000c \( ! -iname "_leftover" \) -print0) &&
    split_out 3 "${big_files[@]}" &&
    rm -f "${big_files[@]}"
    popd > /dev/null
}

### MAIN - Starting Point ###
base_folder="_test2"
while IFS= read -r dir; do
    Func_Clean "$dir"
done < <(find "$base_folder" -mindepth 1 -type d)

If I were you I'd start with that (after any necessary testing/debugging) and THEN look for ways to improve the performance.

edited Jun 21 '20 at 12:12

answered Jun 20 '20 at 03:33

Ed Morton

188,023
17
78
185

Thanks Ed. could you check if this is correct `find "$base_folder" -mindepth 1 -type d -exec Func_Clean "$dir" {} \;`? I get `find: ‘Func_Clean’: No such file or directory`.. am sure this is a newbie ques – rogerwhite Jun 21 '20 at 02:28
Ed: Secondly, even if I work thru this issue, the script seems to hang. I have two very small sub-folders within `base_folder` (called d and e); and I see `$ ./stack2.sh \n _test2/d \n _test2/d` – rogerwhite Jun 21 '20 at 02:47
Ah, right, you need a shell to call the function so `export -f Func_Clean; find . -type d -exec bash -c 'Func_Clean {}' \;` or `find . -type d | xargs -I {} bash -c 'Func_Clean {}'` with all of the other `find` args as in the script. idk why it'd hang withthat fixed and idk what you mean by `I see ...`. Oh hang on - get rid of `"$dir"` from the find line. I'll tweak the find line now. I just put it back to a loop, it's simpler and should be fine. – Ed Morton Jun 21 '20 at 12:08
I also updated the other 2 find commands to print the NUL that readarray requires after each file name. – Ed Morton Jun 21 '20 at 12:16
Thanks Ed. the code works...!! (and faster) I must admit I have not been able to decipher it... like what does this mean `"${@:--}"` – rogerwhite Jun 22 '20 at 00:37
It means use the arguments (file name(s)) that were passed in or stdin (-) otherwise. It's how you write shell functions or scripts to read from a file if called with a file argument or stdin (including from a pipe) otherwise. – Ed Morton Jun 22 '20 at 02:38
Thanks Ed. let me pls read more about this. Could you also pls recheck if `prev = curr` is the most efficient way? `prev = Fpath` and `Fpath != prev` could be better. No? – rogerwhite Jun 22 '20 at 03:27
It doesn't make any difference, it's a string comparison no matter which 2 strings are being compared. The only thing that WOULD matter is if you tried to get rid of curr because then you'd have to do the gensub to set Fpath for every line instead of just lines where the first 2 or 3 significant chars differ and that would slow the script down. – Ed Morton Jun 22 '20 at 04:11
Ed: most welcome.. the least I could do. Question: is there any way of speeding this up further? the code as of now, would create files with 2 chars, and then decide if it needs to be cut to the 3rd char. Is it possible to count num of rows, and then in one go decide if the files should be cut in 2 or 3 char.. just a thought. the code is great! – rogerwhite Jun 22 '20 at 06:10
@rogerwhite maybe using GNU parallel would speed it up as I've mentioned a couple of times now, but I've never used it. You can't count the number of rows to be output until after you do the substr($0,1,2) to determine which rows will be output for a particular 2-char key so then you either have to do that twice (once to figure out if 2-char will be too long and then again to actually do the 2-or-3 char cut) or buffer all of the associated lines to print if 2-char is OK or discard if 3-char is required. – Ed Morton Jun 22 '20 at 11:58
You're **already doing** an implementation the 2nd thing but saving the 2-char lines to a file instead of concatenating them internally then using `find` to check size instead of counting rows and as I mentioned previously printing isn't noticeably slower than concatenating so what you're proposing wouldn't speed things up. – Ed Morton Jun 22 '20 at 12:01
1

Got it thanks. So I need to explore how to use parallel. I will try :| Thanks again for the help – rogerwhite Jun 22 '20 at 12:08
sorry to bother you again on this. the code is really slow. It gets incrementally slow with large files. I guess because the script reads the file in memory, and then sorts it. I have not been able to figure out the parallel execution. would request you to help..! – rogerwhite Jul 02 '20 at 11:18
I can't help with that, sorry, I've never used parallel. The only part of the shell script I posted that handles the whole file at once is `sort` and it uses demand paging, etc. to handle huge files rather than trying to store the whole thing in memory. Having the input sorted by the key values is the big performance improvement over having to open and close output files every time the key value changes. – Ed Morton Jul 02 '20 at 11:22
sorry to cycle back to this old topic. there are 2 issues. [1] there is a bug in the code. the code works perfectly for the first sub-folder in the `base_folder`, for the second the script will hang. [2] the sort command takes exponentially more time if the log-files are huge. Hence, I am thinking would it be more efficient to run the sort, after all the files are split. Meaning instead of sorting one big file, to sort it in the end when there are thousands of small files – rogerwhite Jul 26 '20 at 04:00
@ Ed Morton: Do pls help (for some reason the above comment doesnt show my tag to your name) – rogerwhite Jul 26 '20 at 04:41

David C. Rankin · Answer 2 · 2020-06-20T06:00:36.063

0

You are making things harder on yourself than they need to be. To separate your log into the file em with the sanitized addresses and put the rest in _leftover, you simply need to identify the lines matching /email[0-9]+@/ and then apply whatever sanitizations you need (e.g. remove anything before "email[0-9]+@", convert any included ';' to ':', add more as needed). You then simply redirect the sanitized lines to em and skip to the next record.

    /email[0-9]+@/ {
        $0 = substr($0,match($0,/email[0-9]+@/))
        gsub(/;/,":")
        # add any additional sanitizations here
        print > "em"
        next
    }

The next rule simply collects the remainder of the lines in an array.

    {a[++n] = $0}

The final rule (the END rule), just loops over the array redirecting the contents to _leftover.

    END {
        for (i=1; i<=n; i++)
            print a[i] > "_leftover"
    }

Simply combine your rules into the final script. For example:

awk '
    /email[0-9]+@/ {
        $0 = substr($0,match($0,/email[0-9]+@/))
        gsub(/;/,":")
        # add any additional sanitizations here
        print > "em"
        next
    } 
    {a[++n] = $0}
    END {
        for (i=1; i<=n; i++)
            print a[i] > "_leftover"
    }
' file

When working with awk -- it will read each line (record) and then apply each rule you have written, in order, to each record. So you simply write and order the rules you need to handle manipulating the text in each line.

You can use next to skip to the next record to help control the logic between rules (along with all other conditions, e.g. if, else, ...) The GNU awk manual is a good reference to keep handy as you learn awk.

Example Use/Output

With your input in file you would receive the following in em and _leftover:

$ cat em
email1@foo.com:datahere2
email2@foo.com:datahere2
email3@foo.com datahere2
email5@foo.com:dtat'ah'ere2
email3@foo.com:datahere2

$ cat _leftover
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ

As noted, this script simply trims anything before email...@ and replaces all ';' with ':' -- you will need to add any additional clean-ups you need where indicated.

edited Jun 20 '20 at 06:00

answered Jun 20 '20 at 05:55

David C. Rankin

81,885
6
58
85

Thanks David. But it would be an incorrect assumption that all email addresses in the input file, start with 'email' or ends with 'foo.com'. The email could be anything – rogerwhite Jun 20 '20 at 06:08
Yes, that I figured. What you will need to do is search (or create) a general e-mail matching regex to pick the lines that contain `'@'` and `.com` or `.org`, etc.. You will then eliminate the characters that are not part of the current email RFC5322. That will take a bit of thought if you can have additional invalid characters in your log. Have a look at [How to validate an email address using a regular expression?](https://stackoverflow.com/q/201323/3422102) for that. This was intended to show you how to approach it. It will take anyone time to match your log data. – David C. Rankin Jun 20 '20 at 06:14
You may need several additional rules to replace the `email[0-9]+@` rule shown in the example. – David C. Rankin Jun 20 '20 at 06:18
Thx David. I do not wish to go into some complicated testing. hence I am using: `if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:/ && /^[\x00-\x7F]*$/)`. this does a basic check on the email address, and ensures there are no non-ascii characters in the line – rogerwhite Jun 20 '20 at 06:18
Yes, that is fine -- you can tailor it to meet your needs. The link I provide shows examples of several different levels of detail for email matching. From as simple as `/^\S+@\S+\.\S+$/` to a full-blow RFC match. Recall if you can have garbage chars before the email - you can't use the circumflex (`'^'`) to anchor your match to the beginning -- initially. – David C. Rankin Jun 20 '20 at 06:20
Thanks David. But I am still unsure how to use your code to meet my mission – rogerwhite Jun 20 '20 at 06:28
The rule arrangement will be the same. The first rule identifies those records that contain potential emails. Use your choice of regex to grab those records. Now since you may have invalid characters at the start, you will want to `match()` the beginning of your email address and then use `substr` to trim any bad chars from the front. The next few lines in the rule will eliminate any other invalid characters prior to the `':'` separating the email address from `datahere2`. It is just a matter of adding the checks needed for you logs. Post 20 or so actual lines and I will look in the morning. – David C. Rankin Jun 20 '20 at 07:01
@DavidC.Rankin the existing code is doing `awk | sort | awk` is that if the input isn't sorted before the final `awk` splits it into output files then it'd be opening/closing output files line by line or relying on GNU awk to manage many output files simultaneously, either of which is time consuming, and it can't sort the input until after all undesirable characters have been removed/converted and invalid input has been removed. – Ed Morton Jun 20 '20 at 16:52

Make awk efficient (again)

2 Answers2

Linked