1

Anyone know a better way to do this were it is faster? This currently is slow when pushing high lines per second to this script:


#!/bin/bash

declare -A clientarray
file=$1
timer=$2
e=$(date --date "now +$timer second" +%s)

while read line
do

    if [ -n "${clientarray[$line]}" ]; then
            let "clientarray[$line]=clientarray[$line]+1"
            echo "$line: ${clientarray[$line]}"

    elif [ -z "${clientarray[$line]}" ]; then
            clientarray[$line]=1
            echo "$line: ${clientarray[$line]}"

    fi
    if [ $(date +%s) -gt $e ]; then
                    e=$(date --date "now +$timer second" +%s)

    fi
done < <(tail -F $file | gawk -F"]" '/]/ {print $1}')

Here is an example of the lines:

someline]
someline2]
somethingidontwant
someline3]
somethingelseidontwant
someline4]

and to call the script:

bash script.sh somelogfile.log 1

If I comment out the if logic at the very end it goes really fast but with it the speed drops 2/3rds. Tested it with pv:

(this is with the if logic):

ubuntu@myhost:~/graphs$ tail -F somelogfile.log | pv -N RAW -lc >/dev/null | 
                      > bash script.sh somelogfile.log 1 | pv -N SCP -lc >/dev/null

  RAW: 2.18k 0:00:16 [ 493/s ] [                 <=>                             ]
  SCP:  593 0:00:16 [ 150/s ] [             <=>                                  ]

(this is without)

ubuntu@myhost:~/graphs$ tail -F somelogfile.log | pv -N RAW -lc >/dev/null |
                      > bash script.sh somelogfile.log 1 | pv -N SCP -lc >/dev/null

  RAW: 7.69k 0:00:15 [512/s] [                                     <=>           ]
  SCP:  7.6k 0:00:15 [503/s] [                              <=>                  ]

Let me know if I am missing something on my script or testing side, especially any "DOH!"'s. I think at this point I would love one =)

patch
  • 9
  • 2
  • You are reading `$line` but not using it in the code shown. Is that an artefect of stripping the code down to a minimal reproduction? Similarly, you aren't doing anything with `$e` except updating it periodically. The fact that you have to run an external command (`date`) on each iteration will always make the process slower than when you have only internal commands to execute. Ultimately, you may be better off using Python or Perl or something similar; it can avoid the new process overhead while doing the date calculations more simply in the first place. – Jonathan Leffler Apr 26 '12 at 18:46
  • fixed it, sorry about that I generally use i in place of line I just put line as that is the common example used for while read's and missed changing the body of the script. – patch Apr 27 '12 at 14:01
  • It's a good idea to practice writing code so you won't want to modify it before you post it in public. Then you don't get into problems like that. – Jonathan Leffler Apr 27 '12 at 14:26
  • I agree but its more of a lazy thing then anything...i is 3 letters shorter then line :) – patch Apr 27 '12 at 14:39
  • The alternative is not to be worried about your coding standards and just publish with `i`. I wouldn't have commented on it. Modifying your code before publishing it is the problem; it's hard to get the changes right consistently if you've not run the modified code. (I know; I make the mistake sometimes too. We're all human. Well, all except the cyborgs lurking on the 'net!) – Jonathan Leffler Apr 27 '12 at 14:46
  • I completely agree except I work with sensitive info I can't post so I always have to go over it or be worried about losing my job if I accidently post the wrong thing like a client name or specific piece of code. – patch Apr 27 '12 at 15:04
  • Use `date -f - +%s` as *background process* then iterate with him with `echo` and `read`! See [this answer](https://stackoverflow.com/a/49195703/1765658)! Try it and comment! – F. Hauri - Give Up GitHub Aug 03 '19 at 16:37

2 Answers2

2

As a guess, I'd say it could be that that last if...fi block adds two non-builtin commands per iteration. Everything else in the loop is bash builtins, which execute much faster. With it, you have a call to date within the test, and another in the body of the if. In addition, date --date has to parse and evaluate that "now +$timer second" expression each time it's called, which probably isn't very speedy, given --date's generality. If I were you, I'd try reimplementing this in a scripting language with more native handling of dates/times: Perl, Ruby, Python, whatever you're comfortable with.

You also appear to have a bug:

if [ `date +%s` > $e ] ...

This says: execute the command date +%s and interpolate its output (say 12345) into another command [ 12345 > $e ] (so far so good). That command says: run the [ builtin with two arguments (12345 and ]), and redirect its standard output stream to a file named by the value of $e (uh-oh). You probably want to use -gt instead of > here.

wdebeaum
  • 4,101
  • 1
  • 22
  • 12
  • Or use `[[ $(date +%s) > $e ]]` (where the double brackets are the significant part, though avoiding backquotes and using `$(...)` is also good advice, doubly so when writing shell script in SO comments). – Jonathan Leffler Apr 26 '12 at 18:48
  • ...except that in that case `>` sorts lexicographically instead of numerically. So if the number of seconds since the epoch gains a digit while you're running this program (admittedly very unlikely) you're hosed ;-) – wdebeaum Apr 26 '12 at 18:55
  • Might the usage of `time`, which is a buildin, instead of `date`, help? I don't really understood what the script does, but it seems to do some sort of timing. :) – user unknown Apr 26 '12 at 18:55
  • `time` doesn't give you what time it is, but rather how much time a command takes to execute (in various senses). For example, `time sleep 1` should tell you that `sleep 1` takes about 1 second of real time to complete, and almost no system or user-mode time. – wdebeaum Apr 26 '12 at 19:01
  • Ya my mistake I actually meant -gt not >, also I always use the $() I was just being lazy :)! Made changes to the question. The script goes through and counts a specific filtered item out of a log file. The array is because there is few of them all within the same period of time. Its really a bummer that this can't be done in bash, which language would be the fastest results for this sort of thing? – patch Apr 27 '12 at 13:58
  • I thought there was someway to call up machine time quickly within bash, what type of time does not matter as long as it is 1 second later to compare against. – patch Apr 27 '12 at 14:06
0

I'm not sure what you are doing with $e, but you can print the current date using the shell builtin printf much faster than you can by calling out to date. Subprocess calls tend to be expensive. For example, if you are not on glibc2 you can do:

printf '%(%+)T\n' -1

to get exactly the output of the date command. %+ is not supported on glibc2 so you can construct something identical with other parameters, or something similar with:

printf '%(%c %Z)T\n' -1

If you need to capture and process the date somehow then you may still need a subshell call using $() but there's a decent chance it's still faster than date.

Sam Brightman
  • 2,831
  • 4
  • 36
  • 38