Reducing Time complexity of "Sed" command in bash script

Question

I have a script for daily monitoring of my system that works based on reading from a log file. The command I enter to read and parse the string in the log file using the sed is as follows:

lastline= `cat logs1/$file | sed '/.\{100\}/!d' | sed -n '$p'`

Although this command works well, it takes a long time to execute and I have to reduce the time complexity of its execution. So, Unable to reduce file size. Do you suggest a better solution or alternative to this command?

logfile has 2-3 million lines and its data is like this:

21/11/02 10:05:53.906 | OUT   | OUT | [.0772230000340600720676E00000003406              100210055390                                  121676570608000000NOH1N1AFRN00AFRN136220211102100553254IRT1AFRN000100676            20211102000000029700000003581320000001463900070  1    1      120211102100553                                        H110B                0300000000                    184     202111020000000041        184980011  1849800118480208316         0000000000000000001               184-IR98001 080210     20211102085506 LJA1TSEDRHAHUB220000001463900 0000000000000                                                                                                                                                                                                                    0000000000000000000000000.]
21/11/02 10:05:55.607 | OUT   | IN  | [.000899.]
21/11/02 10:06:00.711 | OUT   | IN  | [.000899.]
21/11/02 10:06:05.714 | OUT   | IN  | [.000899.]
21/11/02 10:06:06.014 | OUT   | OUT | [.0772230000340700720676E00000003407              100210060601                                  121676574028000000NOH1N1SARV00SARV136220211102100605261IRT1SARV000100676            20211102000000100400000000992620000007140000070  1    1      120211102100605                                        H110B                0300000000                                                                                                                                                                                                                           120     202111020000002132        120980011  1209800112080208316         0000000000000000001               120-IR98001            20211102100448 LJA1TSEDRHFHUB220000007140000 0000000000000             0000000000000000000000000.]

In some lines (like line 3,4,2) we have incomplete data. So we should look for the last line with complete data. there is no set rule that I can use to determine the exact number of lines in which complete data exist. There may not be complete data up to line 1000 and it will not return the correct output. (this is why tail does not work)

P.S. Part of the code can be seen in this link : here

First, check your script with shellcheck. Do not use backticks. Do not use `cat file | cmd ..` when you can `cmd .. file`. And `| '/.\{100\}/!d'` is probably invalid. `to this command?` What does the command do exactly? — KamilCuk, Nov 02 '21 at 06:27
The log file data is as follows (https://paste.debian.net/1217728/), but it has 2-3 million lines. Finally, by executing this command (lastline=`cat logs1/$file | sed '/.\{100\}/!d' | sed -n '$p'`), we want to find the last line that contents are complete and do some operation on it. — Fatemeh Abdollahei, Nov 02 '21 at 06:40
"_this command works well_": this is very unlikely. There are several errors in your script (a space after `=`, a missing `sed` in your pipeline...). Please fix it and test it. — Renaud Pacalet, Nov 02 '21 at 06:44
Although I meant the logic of the program, but you can see part of the codes here: https://paste.debian.net/1217732/ — Fatemeh Abdollahei, Nov 02 '21 at 06:54
@FatemehAbdollahei What's wrong with `tail`, as in `lastline=$(tail -1 "logs/$file")` ? — Ljm Dullaart, Nov 02 '21 at 07:43
@FatemehAbdollahei : If time is an issue, I would not use a `sed|sed` solution (and an unnecessary `cat`), but write a script/program which does a **single** pass through the huge file and collects the necessary information. Also, what is a **long** time and how short do you want it to be? The more you need to optimize, the more complex the implementation. — user1934428, Nov 02 '21 at 07:50
@LjmDullaart I tried `tail` as it is in your comment before. but the issue is the log file contains some lines that are complete. (see here: paste.debian.net/1217728 line 2,3,4) and it may return nothing because of that incomplete lines. So tail does not work properly. — Fatemeh Abdollahei, Nov 02 '21 at 08:02
So, operate on a `tail -100` instead of the `cat` of the file. `lastline=$(tail -100 logs1/$file | sed '/.\{100\}/!d' | sed -n '$p')` — Ljm Dullaart, Nov 02 '21 at 08:04
@user1934428 I think searching on a huge file may affect time. This script is for monitoring, so It currently takes more than 2 minutes to run this script, which I want to run in less than a minute. — Fatemeh Abdollahei, Nov 02 '21 at 08:04
@LjmDullaart Unfortunately, there is no set rule that I can use to determine the exact number of lines in which complete data exist. There may not be complete data up to line 1000 and it will not return the correct output. So I should use `sed`. — Fatemeh Abdollahei, Nov 02 '21 at 08:06
What about `lastline=$(tac "logs1/$file" | sed '/.\{100\}/!d' | head -1)` ? — Ljm Dullaart, Nov 02 '21 at 08:11
@FatemehAbdollahei : Unless you have time-intensive calculations to do on each line, the optimal running time will probably be I/O bound and a running time of well under one minutes does not look unreasonable to me. Hence, writing a custom program (in, say, Ruby or Perl or Python) would be advisable (instead of glueing together prefabricated tools). — user1934428, Nov 02 '21 at 08:21
`So we should look for the last line with complete data` Write a program, that reads lines from the end of file and checks those lines if they are complete, if it is, output the line. Write that in a real programming language, it's not possible in `sed` to read a file from the end. — KamilCuk, Nov 02 '21 at 08:51
https://stackoverflow.com/a/38289604/6908895 has an example of such a program, that might serve as a basis for this. — Ljm Dullaart, Nov 02 '21 at 12:17

potong · Answer 1 · 2021-11-02T13:13:55.020

3

This might work for you (GNU tac and sed):

tac file | sed -E '/.{100}/!d;q'

Read backwards through the file, quitting on the first line of 100 or more characters.

edited Nov 02 '21 at 13:13

answered Nov 02 '21 at 12:57

potong

55,640
6
51
83

score 1 · Answer 2 · answered Nov 02 '21 at 08:59

With sed in one single invocation:

lastline=$(sed -n '/^.\{100\}/h;${g;p}' "logs1/$file")

Each line with at least 100 characters is copied to the hold space. At the end of the log file we copy the hold space to the pattern space and we print the pattern space.

If this is not fast enough you'll probably need to use something else than sed.

score 1 · Answer 3 · answered Nov 02 '21 at 09:01

Try this:

lastline=$(awk '(length>=100) {last=$0}; END {print last}' "logs1/$file")

Explanation: awk can do all of this itself, only looking at each line once. It just records the latest line of 100 or more characters in the last variable, and prints it at the end. It also reads directly from the file, avoiding the overhead of cat.

I don't know for certain if this'll be faster or how much so; it may depend on what version of awk you happen to have. But in principle it should be faster since it does less work on each line as it goes through the file.

If you really want it to be fast, I think you'd need to write something like a C program that seeks a ways before the end of file -- maybe a couple of thousand bytes -- and looks for a long line in just that last part of the file. If it doesn't find one, seek back a ways further, and try again.

@tripleee I'm not familiar with how `tac` is implemented, but if that's the case something like potong's suggestion may work well -- as long as what's after `tac` in the pipeline exits as soon as it finds a valid line, `tac` will get a SIGPIPE that should stop it from trying to process the entire file. — Gordon Davisson, Nov 03 '21 at 02:50

dan · Answer 4 · 2021-11-02T12:19:40.290

solution:

Print the last most line containing at least 100 characters:

grep '.\{100\}' | tail -n 1

It can also be done with a single sed:

sed -ne '/.\{100\}/h' -e '${x;p}'

But the grep will usually be faster than the sed. Especially if using GNU grep. It really depends on the grep implementation though.

These rough benchmarks can illustrate the point:

GNU:

$ time grep '.\{100\}' /tmp/rand-lines | tail $ -n 1 >/dev/null

real    0m0.278s
user    0m0.345s
sys 0m0.000s

$ time sed -ne '/.\{100\}/h' -e '${x;p}' /tmp/rand-lines >/dev/null

real    0m0.818s
user    0m0.811s
sys 0m0.000s

GNU grep, piped to tail -n 1, is significantly faster than GNU sed.

Busybox:

$ time busybox grep '.\{100\}' /tmp/rand-lines | tail -n 1 >/dev/null

real    0m10.340s
user    0m10.413s
sys 0m0.000s

$ time busybox sed -ne '/.\{100\}/h' -e '${x;p}' /tmp/rand-lines >/dev/null

real    0m10.588s
user    0m10.583s
sys 0m0.000s

On Busybox, which has a simpler grep implementation, grep still wins, but the difference is more marginal.

The test file was 20,000 lines of random characters (printable ASCII + spaces), containing 7058 lines that have at least 100 characters:

$ wc -l /tmp/rand-lines
20000 /tmp/rand-lines
$ grep -c '.\{100\}' /tmp/rand-lines
7058
$ head -n 1 /tmp/rand-lines
zJ_u)k_#+K!-ZjR#x2{?>Xw3%xOx|):L^SV|=z&fEUJgn;oO9@[Wq[8I^UniwZ0q&CpL,n7]NI^WK7ke{t).=LFHXyI'Z$Dn!g+^ _,Hq<3X*f=>fm8=qYyh!WQUMo_,GLDPPy*N^.(G0!$;+O9WcsSY

Edit: I updated the first sed benchmark. I thought I had GNU sed on that system, but it was running busybox both times. I re-did it with GNU sed (on the same system). The difference between grep and sed is still significant, but less dramatic than I originally wrote.

Interesting. I just did a similar test (20480 lines based on the repetition of the provided example) but with very different results. With GNU grep 3.3 and GNU sed 4.7 the real times are 0m0.023s and 0m0.034s, respectively. I used `grep '^.\{100\}' | tail -n 1` (the `^` anchor seems to provide a small benefit) and `sed -n '/^.\{100\}/h;${g;p}'` (and also with your sed command, which is marginally slower than mine, again probably because of the anchor). grep is faster but the ratio is not that large. Could it come from the different data? — Renaud Pacalet, Nov 02 '21 at 10:30
@RenaudPacalet Thanks for pointing this out, the benchmark was wrong and I’ve fixed it. Given the magnitude of the difference I should have scrutinised it further. I’ve seen GNU grep do some incredibly fast things though, so I just accepted the difference. — dan, Nov 02 '21 at 12:18
Thanks for the benchmark! Could you include @potong's `tac` answer for comparison, too? — tripleee, Nov 02 '21 at 18:01
@tripleee With a 78125 lines file generated by the repetition of the 5-lines example in the OP and GNU grep 3.7, GNU sed 4.8, GNU tac 8.32. `time grep '^.\{100\}' file.log | tail -n 1`: real 0m0.059s, user 0m0.042s, sys 0m0.024s. `time sed -n '/^.\{100\}/h;${g;p}' file.log`: real 0m0.108s, user 0m0.085s, sys 0m0.009s. `time tac file.log | sed -E '/.{100}/!d;q'`: real 0m0.017s, user 0m0.005s, sys 0m0.006s. `tac` wins, `sed` looses. — Renaud Pacalet, Nov 14 '21 at 06:53

score 1 · Answer 5 · answered Nov 02 '21 at 09:37

we should look for the last line with complete data

Write a program that opens a file, seeks till the end, reads lines from the back of the file (by searching for a newline backwards in the file), then check if the line is "complete", and if it is, output the line and terminate the program.

sed can't read the file from the end. Gluing commands together in a pipeline will result that the command on the left will over-read the file and push data to the pipe, which will cause a lot of unnecessary I/O.

Reducing Time complexity of "Sed" command in bash script

5 Answers5