Fastest way to count just the first X lines of output

Question

I have a large terminal output from a tshark filter and I want to check if the number of lines (number of pakages in this example) reaches a threshold of X.

The operation is done in a loop of many big files so I want to boost performance to the max here.

What I think to know is that wc -lis the fastest way to count output from a terminal command.

My line looks like this: (So tshark command does not matter here so I replaced it for readability)

THRESHOLD=100
[[ $(tshark -r $file -Y "tcp.stream==${streamID}" | wc -l) -gt $THRESHOLD ]] || echo "not enough"

While this works nearly fine I wonder if there is a way to stop after the threshold. The exact number does not matter as long as it reaches (or reaches not) the threshold.

A guess would be:

HEAD=$((THRESHOLD+1))
[[ $(tshark -r $file -Y "tcp.stream==${streamID}" | head -n $HEAD | wc -l) -gt $THRESHOLD ]] || echo "not enough"

But piping to an additional service and incrementing the threshold could be slower, isn't it?

EDIT: Changing the example code to a working tshark snippet

Have you actually tried timing it? I don't think I can simulate your platform — Mad Physicist, May 15 '20 at 07:44
You are thinking correctly. Avoid spawning additional processes. Your problem is a chicked-or-the-egg problem. You have to read at least `$THRESHOLD` lines (or `EOF`) before you have a valid comparison. Since you are piping `tsharks output`, that process will complete before being passed to `wc` or `head`. Unless there are hundreds of thousands of lines+ difference between `$THRESHOLD` and the file size, I don't know you save any time between `wc` or `head`. You would just have to time a worst-case and see. — David C. Rankin, May 15 '20 at 07:44
In my scenario it is hard to isolate and time just this step (requires a bit more coding) so my idea was to think about it first before start working for eventually nothing. Thanks David for the answer... — Michael P, May 15 '20 at 07:55
You could pipe the output of `tsharks` into a program, which not only verifies it, as you described, but then **closes its stdin**. The writer (`tsharks`) should then abort with _broken pipe_. — user1934428, May 15 '20 at 08:36
Sound interesting and could be exactly what I have been looking for. Could you post an example as an answer? Is there maybe a more friendly way to break the pipe :) ? — Michael P, May 15 '20 at 08:43
@MichaelP Your example code looks odd. `$($(tsharks ...) | ...` executes the output (!) of `tsharks`. Did you mean `$(tsharks ... | ...` instead? Same for goes for `$(THRESHOLD+1)`. Did you mean `$((THRESHOLD+1))`? — Socowi, May 15 '20 at 10:22

Socowi · Accepted Answer · 2020-05-15T11:16:27.320

3

Benchmark

Only one way to find out: Benchmark it yourself. Here are some implementations that come to mind.

gen() { seq "$max"; }
# functions returning 0 (success) iff `gen` prints less than `$thold` lines
a() { [ "$(gen | head -n"$thold" | wc -l)" != "$thold" ]; }
b() { [ -z "$(gen | tail -n+"$thold" | head -c1)" ]; }
c() { [ "$(gen | grep -cm"$thold" ^)" != "$thold" ]; }
d() { [ "$(gen | grep -Fcm"$thold" '')" != "$thold" ]; }
e() { gen | awk "NR >= $thold{exit 1}"; }
f() { gen | awk -F^ "NR >= $thold{exit 1}"; }
g() { gen | sed -n "$thold"q1; }
h() { mapfile -n1 -s"$thold" < <(gen); [ -z "$MAPFILE" ]; }

max=1''000''000''000
for fn in {a..h}; do
  printf '%s: ' "$fn"
  for ((thold=1''000''000; thold<=max; thold*=10)); do
    printf '%.0e=%2.1fs, ' "$thold" "$({ time -p "$fn"; } 2>&1 | grep -Eom1 '[0-9.]+')"
  done
  echo
done

In the script from above gen is a placeholder for your actual command tsharks output lines. The functions a to g test whether tsharks' output has at least $thold lines. You can use them like

a && echo "tsharks printed less than $thold lines"

Results

These are the results on my system:

a: 1e+06=0.0s, 1e+07=0.1s, 1e+08=0.8s, 1e+09=8.9s,
b: 1e+06=0.0s, 1e+07=0.1s, 1e+08=0.9s, 1e+09=8.4s,
c: 1e+06=0.0s, 1e+07=0.2s, 1e+08=1.6s, 1e+09=16.1s,
d: 1e+06=0.0s, 1e+07=0.2s, 1e+08=1.6s, 1e+09=15.7s,
e: 1e+06=0.1s, 1e+07=0.8s, 1e+08=8.2s, 1e+09=83.2s,
f: 1e+06=0.1s, 1e+07=0.8s, 1e+08=8.2s, 1e+09=84.6s,
g: 1e+06=0.0s, 1e+07=0.3s, 1e+08=3.0s, 1e+09=31.6s,
h: 1e+06=7.7s, 1e+07=90.0s, ... (manually aborted)

b: ... 1e+08=0.9s ... means that approach b took 0.9 seconds to find out that the output of seq 1000000000 had at least 1e+08 (= 100'000'000) lines.

Conclusion

From the approaches presented in this answer b is clearly the fastest. However, the actual results might differ from system to system (there are different implementations and versions for head, grep, ...) and for your atual use-case. I reccommend to benchmark with your actual data (that is, replace the seq in gen() with your tsharks output lines and set thold to any actually used values).

If you need an even faster approach you can experiment more with stdbuf and LC_ALL=C.

edited May 15 '20 at 11:16

answered May 15 '20 at 09:55

Socowi

25,550
3
32
54

Surprisingly `b` was the fastest for me too. – KamilCuk May 15 '20 at 10:28
@KamilCuk Thank you for sharing your results. Why were you surprised? – Socowi May 15 '20 at 10:31
I think I was expecting a single process, like `awk` to win, instead of two processes `tail | head`. – KamilCuk May 15 '20 at 10:34
On my system it is method b) too. Thanks for this nice benchmark example ! – Michael P May 15 '20 at 10:41
I would have expected `sed` to win here. `head` and `tail` are indeed fast ... but not that fast. I believe the sample data is also a player. What if you replace `gen` by `printf -- "x%.0s" $(seq $((max*70))) | fold -w70` and `printf -- "x%.0s" $(seq $((max*6000))) | fold -w6000` (the latter is chosen to be bigger than `PIPE_BUF` – kvantour May 15 '20 at 10:46
@kvantour I don't know and would be happy if you could test it for us :) – Socowi May 15 '20 at 10:46
The reason why I expected `sed` is mainly this related question: https://stackoverflow.com/q/6022384/8344060 – kvantour May 15 '20 at 10:52
Good idea, albeit a bit misleading. That question seems to be about brevity not efficiency for large inputs. In the accepted answer, there is even [a comment](https://stackoverflow.com/questions/6022384/bash-tool-to-get-nth-line-from-a-file#comment34453410_6022431) pointing out that `head | tail` is faster. In that answer there is a huge discussion in the comments on why that may be. – Socowi May 15 '20 at 10:56
@Socowi: surprisingly "head and tail" hands down the fastest with grep a close second. Anything else is orders of magnitude slower. I believe `sed` and `awk` are slower because they have to process and parse the _code_ per processed record. head and tail just search for newline-characters. – kvantour May 15 '20 at 13:02
The combination `head` and `wc` becomes much slower when the lines are longer. Most likely because of the character counting. – kvantour May 15 '20 at 13:05

James Brown · Answer 2 · 2020-05-15T10:36:47.987

Start the tshark (or tail -f -n +1 file) with a wrapper which checks the output line count and exits after the threshold. Here is a sample in awk using seq to mimic the tshark:

$ awk '
BEGIN {
    cmd="seq 1 100"                        # command to execute, outputs 100 lines
    while((cmd|getline res)>0 && ++c<50);  # count to 50 lines and exit
    print res                              # test to show last line of input
    exit
}'

Output:

seq keeps running a while after 50, though, but quits eventually. Change cmd="seq 1 10000000 | tee foo" and tail foo I got:

Fastest way to count just the first X lines of output

2 Answers2

Benchmark

Results

Conclusion