Editing large files efficiently

Question

I have some large logfiles that have the old syslog format dates from RFC3162 (MMM dd HH:mm:ss) that I want to change over to the new syslog format dates from RFC5424 (YYYY-mm-ddTHH:mm:ss +TMZ). I have created the following bash script:

#!/bin/bash

#Loop over directories
for i in $1
do
    echo "Processing directory $i"
    if [ -d $i ]
    then
        cd $i
        #Loop over log files inside the directory
        for j in *.2021
        do
            echo "Processing file $j"
            #Read line by line and perform transformation on dates and append to new file
            cat $j | \
                while read CMD; do
                    tmpdate=$(printf '%s\n' "$CMD" | awk -F" $i" 'BEGIN {ORS=""}; {print $1}')
                    newdate=$(date +'%Y-%m-%dT%H:%M:%S+02:00' -d "$tmpdate")

                    printf '%s\n' "$CMD" | sed 's/'"$tmpdate"'/'"$newdate"'/g' >> $j.new
                done
            mv $j.new $j
        done
        cd ..
    fi
done

But this is taking a looooong time to execute since I have files with several million lines (logs dating back over one year on a mail server for example). So far this has been running for days and still a lot of lines to parse :-)

So two questions.

Why is this script taking such a long time to execute?
Is there a faster way to do this? Using one of GNU utils (sed, awk etc), bash or python.

======== EDIT =======

Here are examples of the old format:

Feb  1 21:59:44 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2
Feb  1 21:59:44 calendar 50mounted-tests: debug: /dev/sda2 type not recognised; skipping
Feb  1 21:59:44 calendar os-prober: debug: os detected by /usr/lib/os-probes/50mounted-tests

Note that there are 2 spaces between Feb and 1, if the date is 10 or higher the space is only 1 as in

Feb 10 10:39:53 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2

In the new format it would look like this:

2021-02-01T21:59:44+02:00 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2
2021-02-01T21:59:44+02:00 calendar 50mounted-tests: debug: /dev/sda2 type not recognised; skipping
2021-02-01T21:59:44+02:00 calendar os-prober: debug: os detected by /usr/lib/os-probes/50mounted-tests

TIA.

You probably want `for i in "$@"` rather than `for i in $1` - by definition, `$1` can only contain a single item. — tripleee, Aug 24 '21 at 09:13
Well, technically the unquoted `$1` undergoes word splitting and glob expansions, so `for i in $1` would *work* (or *break*, depending on your view) if you called `script.sh "dir1 dir2 dir3"` or `script.sh "*"`. Yeah, but `for i in "$@"; do` or just `for i; do` would be the sane way to handle multiple arguments. — Socowi, Aug 24 '21 at 09:47
Ah, yes that is`$1` is because I was feeding the script one file at a time. I have to do this on my laptop because of $things and since its taking so long to parse some files and I need to power off my laptop when I finish work then I did this while trying to figure out a better way. — proxymoxy, Aug 24 '21 at 10:21

score 2 · Answer 1 · answered Aug 24 '21 at 09:05

2

You are rewriting the entire file with sed as many times as you have lines in the file. This is a huge but unfortunately fairly common beginner antipattern.

The pipeline to create the sed command is also quite overcomplicated and inefficient.

You don't really need date to convert between date formats when the result will contain exactly the same information in a different order. Try something like

awk -vyyyy="$(date +%Y)" 'BEGIN {
    split("Jan:Feb:Mar:Apr:May:Jun:Jul:Aug:Sep:Oct:Nov:Dec", _m, ":");
    for(i=1; i<=12; ++i) m[_m[i]] = i }
{ printf "%04i-%02i-%02iT%s+02:00 %s",
    yyyy, m[$1], $2, $3, substr($0, 17) }' "$j" >"$j.new"

Demo: https://ideone.com/VBDqB8

answered Aug 24 '21 at 09:05

tripleee

175,061
34
275
318

1

Thanks for this, it works perfectly. Just one correction, `sed` never touches the file, it gets a pipe from `stdin` and pipes it out to the end of a new file (or is it perhaps the appending to the new file that you are refering to?) – proxymoxy Aug 24 '21 at 09:33
1

Yeah, you are reading the whole file as many times as there are input lines from the [(useless!) `cat`](https://stackoverflow.com/questions/11710552/useless-use-of-cat). Related: https://stackoverflow.com/questions/65538947/counting-lines-or-enumerating-line-numbers-so-i-can-loop-over-them-why-is-this – tripleee Aug 24 '21 at 09:46

Socowi · Accepted Answer · 2021-08-24T10:10:07.347

Why is this script taking such a long time to execute?

Bash is a scripting language and intended to run other programs. Therefore, bash itself as a language isn't very fast. But it gets even worse if you repeatedly start other processes. Starting a process is very costly. Every time you execute something like sed, awk, date, or even just $(...) or ... | ... you start a process. In a loop, this adds up.

Compare time for ((i=0; i<1000; ++i)); do true; done vs. time for ((i=0; i<1000; ++i)); do /bin/true; done. The former uses bash's built-in command and therefore does not start other processes; it immediately finishes. The latter uses an external program and therefore repeatedly starts a process; it takes 4.5s seconds on my system.

Is there a faster way to do this? Using one of GNU utils (sed, awk etc), bash or python.

Yes. If you rewrite your script in python it will run tremendously faster, assuming you use pythons built-in functions, instead of repeatedly calling sp = subprocess.run(["date", ...], stdout=subprocess.PIPE]) and newDate = sp.stdout and so on :)
_{When writing it that way, you would immediately notice that this cannot be effective. bash makes it so easy to run other programs that you often forget all the work that is done behind the scenes.}

But since you tagged your question as bash, lets stick to a script solution.

The transformation of MMM to MM (e.g. Jan to 01) is a bit tricky for sed. We have to use a separate replacement for each month. Luckily, the month is always at the beginning, so we can replace it separately from the rest of the date.
To add a leading zero to single digit days we use an additional replacement.

sed -i.bak -E -e's/^Jan/01/;s/^Feb/02/;s/^Mar/03/;...' \
  -e's/^(..)  /\1 0/' \
  -e's/^([0-9]+)  ?([0-9]+) ([0-9]+:[0-9]+:[0-9]+)/2021-\1-\2T\3+02:00/' */*.2021

The first expression can be automatically generated:

monthNameToNumber=$(
   printf %s\\n Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec |
   awk '{printf "s/^%s/%02d/;", $0, NR}'
)
sed -i.bak -E -e"$monthNameToNumber" \
  -e's/^(..)  /\1 0/' \
  -e's/^([0-9]+)  ?([0-9]+) ([0-9]+:[0-9]+:[0-9]+)/2021-\1-\2T\3+02:00/' */*.2021

This replaces all dates at the start of your log lines, in all log files one directory under the current one. The logs will be modified in-place. A backup of each log is created with the suffix .bak.

The `-E` option to `sed` is not portable. The OP tagged this [tag:linux] so it will probably work for them (though it's not uncommon to see beginners tag Mac and Windows questions incorrectly as Linux, for whatever reason) but it won't work e.g. on a Mac out of the box - though `sed -r` would work much the same there. — tripleee, Aug 24 '21 at 09:19
@tripleee And neither is `-i`. But these days, every `sed` implementation I know supports `-E` and `-i suffix`. In fact `-E` came from BSD while GNU used `-r` and added the `-E` as a synonym later (in the [commit message](http://git.savannah.gnu.org/gitweb/?p=sed.git;a=commit;h=8b65e07904384b529a464c89f3739d2e7e4d5135), the author even claimed that `-E` was added to POSIX, but I couldn't find it there). Since macOS uses `sed` from BSD, [macOS supports `-E` and `-i suffix`](https://www.unix.com/man-page/osx/1/sed/). — Socowi, Aug 24 '21 at 09:39
This looks more promising, one question though. Is there any danger of that if the message part of the log (that is the text after the time stamp) contains something like `something 1 something else` that the new regex would replace that? — proxymoxy, Aug 24 '21 at 09:39
@proxymoxy No, at least not if every line starts with a date. I used `^` (beginning of the line) to prevent that. `s/^(..) /\1 0/'` only process the first four characters of each line. — Socowi, Aug 24 '21 at 09:42
@socowi Excellent, this is then just what the doctor ordered :-) — proxymoxy, Aug 24 '21 at 09:46

Editing large files efficiently

2 Answers2