0

I am trying to parse large log file by grep strings between 2 different patterns

example :

line1
line2
...
lineN
pattern1
line4
line6
pattern2
....
other lines
pattern1
line8
line9
pattern2
...

The lines I need to catch is the part between pattern1/pattern2 (so, line4 through line6 and line8 through line9).

I am using

sed -n '/pattern1/,/pattern2/p

to search the file , but it takes really long time to complete (yeah, my log file is large ...)

I am wondering is there a more efficient way to speed up the search? Ideally a single line command (awk/grep etc...) or Python.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Jia
  • 2,417
  • 1
  • 15
  • 25
  • 1
    I doubt you'd find any solution with awk/python to be faster than `sed`.. you can use `LC_ALL=C sed -n '/pattern1/,/pattern2/p'` to speed up if input is all ASCII... see also https://stackoverflow.com/a/38978201/4082052 if you do not want the starting/ending lines in output – Sundeep Mar 04 '18 at 09:58
  • You could test if this is going a little faster: `sed -n '/^pattern1$/,/^pattern2$/p` – Cyrus Mar 04 '18 at 10:08
  • 1
    If the real patterns are static strings, it's not impossible that Awk with `$0 == "pattern1"` could be faster than `sed -n '/^pattern1$/` but in the grand scheme of things, I/O buffering overhead will massively dominate over and shadow any code performance differences. – tripleee Mar 04 '18 at 10:16
  • I tried suggestions from Sundeep and Cyrus, no obvious speed up observed. – Jia Mar 04 '18 at 10:29
  • @Sundeep the links you offered does help, thank you ! – Jia Mar 04 '18 at 10:37
  • Possible duplicate of [How to select lines between two patterns?](https://stackoverflow.com/questions/38972736/how-to-select-lines-between-two-patterns) – Cyrus Mar 04 '18 at 11:14

2 Answers2

0

You can try:

awk '/pattern1/,/pattern2/'

In my experience mawk can be significantly faster than sed with this kind of operation and is usually the fastest. Alternatively gawk4 can be much faster than gawk3, so you could try that too.

--edit--

FWIW, just did a small test on a file with 4 million lines

On MacOS 10.13:

sed  :         1.62 real         1.61 user         0.00 sys
gsed :         1.31 real         1.30 user         0.00 sys
awk  :         2.14 real         2.12 user         0.00 sys
gawk3:         5.05 real         3.90 user         1.13 sys
gawk4:         0.61 real         0.60 user         0.00 sys
mawk :         0.42 real         0.40 user         0.00 sys

On Centos 7.4:

gsed :         1.56 real         1.54 user         0.01 sys
gawk4:         1.31 real         1.29 user         0.01 sys
mawk :         0.56 real         0.54 user         0.01 sys
Scrutinizer
  • 9,608
  • 1
  • 21
  • 22
  • what if I want to print both pattern1 / pattern2 besides the strings in between ? – Jia Mar 04 '18 at 10:27
  • Use Scrutinizer's answer? – Cyrus Mar 04 '18 at 10:32
  • Like Cyrus says this answer includes both pattern1 and pattern2. Also added some test results... – Scrutinizer Mar 04 '18 at 10:53
  • 1
    yes, Cyrus and Scrutinizer , it does include pattern1/pattern2 (sorry, I made a mistake on my test so no pattern1/2 displayed) . As for cmd exec time , mine got obvious improvement when using "awk" command (Ubuntu16.6) ---- time using "sed -n '/pattern1/,/pattern2/p' " ---- real 7m14.025s user 6m56.217s sys 0m17.552s --- time using " awk '/pattern1/,/pattern2/' " ---- real 1m46.925s user 1m34.533s sys 0m13.066s – Jia Mar 04 '18 at 12:39
  • @Jia, if `mawk` is not present on your system, you could try and install it (it should be available for Ubuntu), to see if it renders further improvements. – Scrutinizer Mar 04 '18 at 14:47
  • sure, I will try. thank you :) One more question if you don't mind , do you know why awk is times faster than sed in this case ? Just curious , I am very new to sed and awk . – Jia Mar 04 '18 at 14:52
0

You can try this if you use Python:

m = re.search(r'(?<=pattern1)(.|\s)*?(?=pattern2)', log_file, re.MULTILINE)
nyr1o
  • 966
  • 1
  • 9
  • 23
  • Louis, seems like I cannot match my expected strings out using this command , I am still trying to figure out, will let you know if I find working pattern using re module , thank you ! – Jia Mar 04 '18 at 12:44