Speed up searching a large file using sed or an alternative

Question

I have several large files in which I need to find a specific string and take everything between the line which contains the string and the next date at the beginning of a line. This file looks like this:

20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo

The output I need is this:

20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: bla this_i_need bla

Now I'm using sed '/'"$string"'/,/'"$date"'/!d' which works as intended except it also takes the next row with the date even if it doesn't contain the string, but it's not a big problem.

The problem is that it takes a really long time searching the files. Is it possible to edit the sed command so it will run faster or is there any other option to get a better runtime? Maybe using awk or grep?

EDIT: I forgot to add that the expected results occur multiple times in one file, so exiting after one match is not suitable. I am looping trough multiple files in a for loop with the same $string and same $date. There are a lot of factors slowing the script down that i can't change (extracting files one by one from a 7z, searching and removing them after search in one loop).

Maybe [this answer](https://stackoverflow.com/questions/19257597/find-specific-pattern-and-print-complete-text-block-using-awk-or-sed) will help you ? — MyICQ, May 25 '22 at 08:59
The overhead of reading the data in the first place is what is taking time; you can't really get much faster than `sed` for a simple application like this. If you need to analyze the same data files multiple times, reading them into a database and indexing on the interesting fields could speed things up. Creating the database will take more time than just reading the files with `sed`, but you get that time back as you run multiple analyses on the database. — tripleee, May 25 '22 at 09:31
I bet what you're REALLY using is a shell loop calling sed multiple times for different values of `$string` and/or `$date`. If so it's the shell loop that's slowing you down, not the sed command, but we can't help you with that real problem as the surrounding code is missing from your question. — Ed Morton, May 25 '22 at 12:46

score 1 · Answer 1 · answered May 25 '22 at 11:13

1

Using sed you might use:

sed -n '/this_i_need/{:a;N;/\n20220520/!ba;p;q}' file

Explanation

-n Prevent default printing of a line
/this_i_need/ When matching this_i_need
:a Set a label a to be able to jump back to
N pull the next line into the pattern space
/\n20220520/! If not matching a newline followed by the date
ba Jump back to the label (like a loop and process what is after the label again)
p When we do match a newline and the date, then print the pattern space
q Exit sed

Output

20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla

answered May 25 '22 at 11:13

The fourth bird

154,723
16
55
70

Thank you for your response. I've tried your solution with the data like i will be using and got an error - `-bash-3.2$ sed -n '/010001226208B974/{:a;N;/\n"20220520"/!ba;p}' ftm_ain_MC.log.203012.20220520 > test.010001226208B974 Label too long: /010001226208B974/{:a;N;/\n"20220520"/!ba;p}` Is the string i'm looking for too long? – qwedcxzas May 25 '22 at 13:20
@qwedcxzas What is the system that you are running the code on? – The fourth bird May 25 '22 at 13:23
SunOS hostname 5.10 Generic_153154-01 i86pc i386 i86pc – qwedcxzas May 25 '22 at 13:24
@qwedcxzas Did you see this post? https://stackoverflow.com/questions/20840661/solaris-sed-label-too-long – The fourth bird May 25 '22 at 13:24
Yes that was the first one i checked but i dont really now which part should i take out instead of the ':1'. I'm sorry, i don't really understand sed, that's why i'm asking for help. – qwedcxzas May 25 '22 at 13:29
@qwedcxzas Does it work [like this](https://tio.run/##fY7NCsIwEITveYqxl55qfmxRQp7BJxBCQiMpSFPbggfx2WNMEXqoLsPuYb5ZxprJxzi5FlXvUNLZd5PudO9cS58EaaTJ55w3vfSCCcEawegOFos35H0nrxJKKRRfpOJcNgfJ@F7UQuIaQpI1I9kC6gzoZMPejJdYVSEPb2aQOYC0YSvdHOUntegPsH6KX@gpoWZMSn1CEeMb) ? – The fourth bird May 25 '22 at 13:33
Not really, i get an `Unrecognized command: /\n20220520/! b a` error. I don't understand why tho, it works in your link. – qwedcxzas May 25 '22 at 13:48
@qwedcxzas And if you just write `sed -ne '/this_i_need/{:a;N;/\n20220520/!ba;p;q}' file` – The fourth bird May 25 '22 at 13:52

score 0 · Answer 2 · answered May 25 '22 at 08:52

0

With sed it has to delete all the lines outside the matching ranges from the buffer, which is inefficient when the file is large.

You can instead use awk to output the desired lines directly by setting a flag upon matching the specific string and clearing the flag when matching a date pattern, and outputting the line when the flag is set:

awk '/[0-9]{8}/{f=0}/this_i_need/{f=1}f' file

Demo: https://ideone.com/J2ISVD

answered May 25 '22 at 08:52

blhsing

91,368
6
71
106

You might consider quitting as soon as you have found what you are looking for - so that if you find it on line 3 you don't read the remaining many gigabytes of data... – Mark Setchell May 25 '22 at 09:29
This isn't really correct regarding how `sed` works. It simply reads a line at a time and keeps an internal flag which tells it whether the first pattern has been seen on a previous line, exactly like your Awk script does it. – tripleee May 25 '22 at 09:30

Daweo · Answer 3 · 2022-05-25T13:59:07.370

You might use exit statement to instruct GNU AWK to stop processing, which should give speed gain if lines you are looking ends far before end of file. Let file.txt content be

20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what 
to 
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo

then

awk 's&&/^[[:digit:]]{8}.*this_i_need/{print;exit}/this_i_need/{p=1;s=1;next}p&&/^[[:digit:]]{8}/{p=0}p{print}' file.txt

gives output

what 
to 
do
20220520-11:53:01.257: bla this_i_need bla

Explanation: I use 2 flag-variables p as priting and s as seen. I inform GNU AWK to

print current line and exit if seen and line starts with 8 digits followed by 0 or more any characters followed by this_i_need
set p flag to 1 (true) and s flag to 1 (true) and go to next line if this_i_need was found in line
set p flag to 0 (false) if p flag is 1 and line starts with 8 digit
print current line if p flag is set to 1

Note that order of actions is crucial.

Disclaimer: this solution assumes that if line starts with 8 digits, then it is line beginning with date, if this is not case adjust regular expression according to your needs.

(tested in gawk 4.2.1)

Thank you for your response. This actually works and is fastest than my sed, but I also need the first line with date containing `this_i_need` along with all others following. The matches may occur multiple times in one file. Is there a possibility to also include lines containing `this_i_need` ? — qwedcxzas, May 25 '22 at 13:42
@qwedcxzas I edited my answer to comply with that requirement — Daweo, May 25 '22 at 13:59

markp-fuso · Answer 4 · 2022-05-25T14:19:38.477

0

Assumptions:

start printing when we find the desired string
stop printing when we read a line that starts with any date (ie, any 8-digit string)

One awk idea:

string='this_i_need'

awk -v ptn="${string}" '         # pass bash variable "$string" in as awk variable "ptn"
/^[0-9]{8}/ { printme=0 }        # clear printme flag if line starts with 8-digit string
$0 ~ ptn    { printme=1 }        # set printme flag if we find "ptn" in the current line
printme                          # only print current line if printme==1
' foo.dat

Or as a one-liner sans comments:

awk -v ptn="${pattern}" '/^[0-9]{8}/ {printme=0} $0~ptn {printme=1} printme' foo.dat

NOTE: OP can rename the awk variables (ptn, printme) as desired as long as they are not a reserved keyword (see 'Keyword' in awk glossary)

This generates:

20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla

edited May 25 '22 at 14:19

answered May 25 '22 at 14:12

markp-fuso

28,790
4
16
36

Is it possible to break out of the awk if the first n characters match a string? The matches i need from the file occur during one, max two seconds and the file has 50+ GB of data from all day. So let's say i need data of $string that happened at 01:02:03, can the awk be written in a way that it ends its search in a file if a line starts as 20230228-01:02:03? Or if the line contains 01:02:03? – qwedcxzas Feb 28 '23 at 12:04
@qwedcxzas sure, when you find a match for the desired pattern you can run the `awk` command `exit`, eg, `/01:02:03/ { exit }` – markp-fuso Feb 28 '23 at 13:43

Speed up searching a large file using sed or an alternative

4 Answers4