Is a /start/,/end/ range expression ever useful in awk?

Question

I've always contended that you should never use a range expression like:

/start/,/end/

in awk because although it makes the trivial case where you only want to print matching text including the start and end lines slightly briefer than the alternative*:

/start/{f=1} f{print; if (/end/) f=0}

when you want to tweak it even slightly to do anything else, it requires a complete re-write or results in duplicated or otherwise undesirable code. e.g. if you want to print the matching text excluding the range delimiters using the second form above you'd just tweak it to move the components around:

f{if (/end/) f=0; else print} /start/{f=1}

but if you started with /start/,/end/ you'd need to abandon that approach in favor of what I just posted or you'd have to write something like:

/start/,/end/{ if (!/start|end/) print }

i.e. duplicate the conditions which is undesirable.

Then I saw a question posted that required identifying the LAST end in a file and where a range expression was used in the solution and I thought it seemed like that might have some value (see https://stackoverflow.com/a/21145009/1745001).

Now, though, I'm back to thinking that it's just not worth bothering with range expressions at all and a solution that doesn't use range expressions would have worked just as well for that case.

So - does anyone have an example where a range expression actually adds noticeable value to a solution?

*I used to use:

/start/{f=1} f; /end/{f=0}

but too many times I found I had to do something additional when f is true and /end/ is found (or to put it another way ONLY do something when /end/ is found IF f were true) so now I just try to stick to the slightly less brief but much more robust and extensible:

/start/{f=1} f{print; if (/end/) f=0}

I only recently learned about range expressions and I like them! Of course, there's nothing they can do that a "flag" variable couldn't but I would argue that they _are_ useful. Granted, in using `f` you're cutting down on (one kind of) repetition but in doing so you're taking on the responsibility of keeping track of `f` between records. This effectively means that in order to understand the script, you have to read it (at least) twice, rather than once. — Tom Fenech, May 29 '14 at 18:16
@TomFenech how would you deal with enhancing the script to, say, not print the start/end lines? Throw away the original and start again with a variable, or introduce duplication of the start/end conditions with an `if` in the action block or something else? My concern with the range expression is just there's IMHO no reasonable way to build upon it if/when your requirements change. — Ed Morton, May 29 '14 at 18:23
I don't think there's anything wrong with the `if`. It's a simple combination of two regexes and neither approach scales particularly well with multiple conditions anyway. I guess you _could_ do `/start/ {getline; do { print; getline } while (!/end/)}` if you really wanted ;) — Tom Fenech, May 29 '14 at 18:45
The issue with the `if` is that you're duplicating code so if you had to test for a different condition later then you'd need to make the same change in 2 places which is generally undesirable in software. wrt the `getline` suggestion - that is fraught with issues and should not be implemented, make sure you read and fully understand http://awk.info/?tip/getline if you're considering using `getline`. — Ed Morton, May 30 '14 at 14:14
I was only joking about using `getline` but thanks for the link anyway :) In terms of avoiding repetition, you could always set the patterns to variables and use the `~` operator: `$0~s, $0~e {if(!($0~s||$0~e)) print}` file`. Anyway, all of the approaches are hacky in my opinion, so to each his own. — Tom Fenech, May 30 '14 at 14:43
vi supports addresses with an offset`:/{/+1,/}/-1 >`. It might be an improvement for awk, now it is only easy for the simple cases.. — Walter A, Feb 19 '15 at 15:23
That'd just make something that's trivial and rarely needed just slightly briefer at the cost of increased language bloat and down that path lies the dreaded p*** syntax. — Ed Morton, Feb 19 '15 at 16:25

Scrutinizer · Accepted Answer · 2014-05-29T17:41:33.297

15

Interesting. I also often start with a range expression and then later on switch to using a variable..

I think a situation where this could be useful, aside from the pure range-only situations is if you want to print a match, but only if it lies in a certain range. Also because it is immediately obvious what it does. For example:

awk '/start/,/end/{if(/ppp/)print}' file

with this input:

start
dfgd gd
ppp 1
gfdg
fd gfd
end
ppp 2 
ppp 3
start
ppp 4
ppp 5
end
ppp 6
ppp 7
gfdgdgd

will produce:

ppp 1
ppp 4
ppp 5

-- One could of course also use:

awk '/start/{f=1} /ppp/ && f; /end/{f=0}' file

But it is longer and somewhat less readable..

edited May 29 '14 at 17:41

answered May 29 '14 at 17:28

Scrutinizer

9,608
1
21
22

OK, I'll think about that, thanks for the response. By default I would use `/start/{f=1} f{if (/ppp/) print; if (/end/) f=0}` for that since that's the obvious enhancement to my base solution of `/start/{f=1} f{print; if (/end/) f=0}`). – Ed Morton May 29 '14 at 17:54
1

+1: I also found the simple things we could do for instance print from a pattern to the end of the file by saying `awk '/patt/,0' file` instead of doing `awk '/patt/{p=1}p' file` – jaypal singh May 30 '14 at 01:26
I marked this answer accepted because I think at the end of the day it's just not a big deal and if there are times people prefer to use a range expression as a starting point, at least if/when the requirements change such that that no longer makes sense, it happens right away so they won't have that much code to re-write. It also means you can write an awk solution that looks like an equivalent sed solution and so it might help people not to be tempted to enhance a sed solution to do something complicated. Thanks all for the responses. – Ed Morton May 30 '14 at 13:52
1

@EdMorton. Thank you and thank you for the discussion, I find it interesting . Your suggested standard approach `/start/{f=1} f{print; if (/end/) f=0}` perfectly mimics `/start/,/end/` while other approaches are perhaps more of an approximation. So I think it figures that this is good code to use if you want to be able to extend it later without a rewrite... – Scrutinizer May 30 '14 at 15:19

score 5 · Answer 2 · answered Jul 10 '15 at 14:46

While you are right that the /start/,/end/ range expression can easily be reimplemented with a conditional, it has many interesting use-cases where it is used on its own. As you observe it, it might have little value for processing of tabular data, the main but not only use case of awk.

So - does anyone have an example where a range expression actually adds noticeable value to a solution?

In the mentioned use-cases, the range expression improves legibility. Here are a few examples, where the range expression accurately selects the text to be processed. These are only a hand of examples, but there is countlessly similar applications, demonstrating the incredible versatility of awk.

Filter logs within a time range

Assuming each log line starts with an ISO timestamp, the filter below selects all events in a given range of 1 hour:

awk '/^2015-06-30T12:00:00Z/,/^2015-06-30T13:00:00Z/'

Extract a document from a file

awk '/---- begin file.data ----/,/---- end file.data ----/'

This can be used to bundle resources with shell scripts (with cat), to extract parts of GPG-signed messages (prepared with --clearsign) or more generally of MIME-messages.

Process LaTeX files

The range pattern can be used to match LaTeX environments, so for instance we can select the abstracts of all articles in our directory:

awk '/begin{abstract}/,/end{abstract}/' *.tex

or all the theorems, to prepare a theorem database!

awk '/begin{theorem}/,/end{theorem}/' *.tex

or write a linter ensuring that theorems do not contain citations (if we regard this as bad style):

awk '
  /begin{theorem}/,/end{theorem}/ { if(/\\cite{/) { c+= 1 } }
  END { printf("There were %d bad-style citations.\n", c) }
'

or preprocess tables, etc.

The point is though that if you needed to do ANYTHING slightly more interesting then you'd need a complete rewrite or duplicate conditions. For example, try enhancing `awk '/begin{theorem}/,/end{theorem}/'` to simply not print the start and lines of each block and you'll find you need to either duplicate the start and end conditions inside an action section and add an explicit print (`awk '/begin{theorem}/,/end{theorem}/{if (!(/begin{theorem}|end{theorem}/)) print}'`) or you'll need to redesign it to use a flag `awk '/end{theorem}/{f=0} f; /begin{theorem}/{f=1}'` so why not always use a flag? — Ed Morton, Jul 10 '15 at 15:07
The method to filter log-files is flawed as it requires that those two dates are in the log-file. If, for whatever reason they are not, the filtering will not work. — kvantour, Jul 10 '19 at 14:39

Is a /start/,/end/ range expression ever useful in awk?

2 Answers2

Filter logs within a time range

Extract a document from a file

Process LaTeX files

Linked