1

I've encountered the following problem and haven't found a solution nor why awk behaves in this strange way.

So let's say I have the following text in a file:

startcue
This shouldn't be found.

startcue
This is the text I want to find.
endcue

startcue
This shouldn't be found either.

And I want to find the lines "startcue", "This is the text I want to find.", and "endcue".

I naively assumed that a simple range search by awk '/startcue/,/endcue/' would do it, but this prints out the whole file. I guess awk somehow finds the first range, but as the third startcue triggers on the printing of lines, it prints all the lines until the end of the file (still, this all seems a bit strange to me).

Now to the question: How can I get awk to print out just the lines I wan't? And maybe as an extra question: Can anybody explain awk's behaviour?

Thanks

  • 2
    That range matches as many times as it can. The first match is line 1 to `endcue` the second match is the last `startcue` to the end. So that should not print that second blank line. How do you expect awk to know which startcue to use (for your suggested usage). You can do what you want by manually keeping the lines (and dropping previously saved lines when you hit a new start line). – Etan Reisner Apr 30 '15 at 21:23
  • 2
    Never use range expressions, always use a flag instead, e.g. `/start/{f=1} f; /end/{f=0}`. Range expressions make scripts to solve trivial jobs very slightly briefer but then require a complete rewrite and/or duplicate conditions when even the tiniest morsel of complexity is introduced, as you are discovering. – Ed Morton Apr 30 '15 at 21:24

4 Answers4

3
$ awk '/startcue/{f=1; buf=""} f{buf = buf $0 RS} /endcue/{printf "%s",buf; f=0}' file
startcue
This is the text I want to find.
endcue
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • sure, it inits a buffer and sets a flag when it finds the first regexp, adds to the buffer every line while the flag is set, and then prints the buffer and resets the flag when it finds the last regexp. – Ed Morton May 01 '15 at 04:36
  • 1
    I'm going to link to your [fantastic answer](http://stackoverflow.com/a/17914105/258523) again as seeing other uses of this pattern might help people understand it. – Etan Reisner May 01 '15 at 13:22
2

Here is a simple way to do it.
Since data is separated by blank lines, I set RS to nothing.
This makes awk to work with data in blocks.
Then find all blocks starting with startcue and ending with endcue

awk -v RS="" '/^startcue/ && /endcue$/' file
startcue
This is the text I want to find.
endcue

If startcue and endcue are always start line and end line and does only appears once int the block, this should do: (PS testing does show that it does not matter if there are more or less hits in the block. This always prints the block if both startclue and endcue are found)

awk -v RS="" '/startcue/ && /endcue/' file
startcue
This is the text I want to find.
endcue

And this should work too:

awk -v RS="" '/startcue.*endcue/' file
startcue
This is the text I want to find.
endcue
Jotne
  • 40,548
  • 12
  • 51
  • 55
1

To summarize the problem, you want print lines from startcue to endcue but not if the endcue is missing. Ed Morton's approach is good. Here is yet another approach:

$ tac file | awk '/endcue/,/startcue/' | tac
startcue
This is the text I want to find.
endcue

How it works

  • tac file

    This prints the lines in reverse order. tac is just like cat except that the lines come out in reverse order.

  • awk '/endcue/,/startcue/'

    This prints all lines starting from endcue and finishing with startcue. When done this way, passages with missing endcues are not printed.

  • tac

    This reverses the lines once again so that are back in the correct order.

How awk ranges work

Consider:

 awk '/startcue/,/endcue/' file

This tells awk to start printing when if finds startcue and continue printing until if finds endcue. This is exactly what it does on your file.

There is no implied rule that the range /startcue/,/endcue/ cannot itself contain multiple instances of startcue. awk simply starts printing when it sees the first occurrence of startcue and continues until if finds endcue.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • This will just inverse the problem though. This will print sections that have an `endcue` but no `startcue`. This is also significantly less efficient than the more straightforward (no pun intended) approach that Ed uses. – Etan Reisner Apr 30 '15 at 22:53
  • Thanks! This works as well, but as Ed provided an awk-only solution, I'll go with his. Nice explanation, though! – Tryst Morer Apr 30 '15 at 22:55
  • @EtanReisner (1) The OP showed only missing endcues, so, yes, _as stated in the first sentence of this answer_, this answer was only about missing endcues. (2) Sometimes computer "efficiency" is important. Often times, it is the efficient use of the programmer's time that is more important. Since this code is short and does not require grokking code that defines and updates variables, I believe it fits the later meaning of efficiency. – John1024 May 01 '15 at 06:50
  • Question posters *routinely* leave out **critical** details of their issue. Tailoring a solution to the exact specifics (when a generic solution is available) is often not the best way to answer a question. That all being said I was not attacking your answer as much as pointing out a detail that the OP (and later people looking at this answer) might not immediately realize exists here. – Etan Reisner May 01 '15 at 13:20
0

no buffering needed :

{m,n,g}awk 'BEGIN { _ +=_ ^= ORS = FS = RS = "\nendcue\n"
                   sub("end", "?start", RS)
                   __= substr(RS, _+--_) } (NF=_<NF) && $!_=__$_'
startcue
This is the text I want to find.
endcue
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11