3

A simple sed expression to extract a block of lines delimited by regular expressions from a text file looks like this:

$ sed -n -e '/start-regex/,/end-regex/ p' input_file

This selects lines from and including the line matching start-regex up to and including the line matching end-regex.

The line matching end-regex may be excluded like this:

$ sed -n -e '/start-regex/,/end-regex/ {/end-regex/d;p}

Is it possible to do this without repeating end-regex ?

If it's possible to omit the last line, then would it follow that it's also possible to omit the first and/or last line without repeating the regexes ?

The reason for this question is to find a more efficient way of solving the problem than repeating expressions which can be complex and hard to read.

This question is about sed, and a single instance thereof, specifically. There may be ways to do this with pipelines of head, tail, awk, etc, but the question asks if this is possible using sed only.

There are a number of similar questions but they ask for solutions to specific use-cases rather than dealing with the generic problem at source.

Any solution should work with GNU sed.

starfry
  • 9,273
  • 7
  • 66
  • 96
  • 1
    See [How to select lines between two patterns?](http://stackoverflow.com/q/38972736/1983854) with a sed and awk solution on this. – fedorqui Aug 19 '16 at 10:38
  • that's interesting but it repeats the regex. – starfry Aug 19 '16 at 10:50
  • So with _without repeating `end-regex`_ you mean to have the `sed` command written in a way that the `end-regex` is just written once, right? – fedorqui Aug 19 '16 at 10:52
  • Yes, that is what I mean. – starfry Aug 19 '16 at 10:58
  • vim allows using `/pat1/+1,/pat2/-1` which works to an extent depending on where the cursor is and pattern used.. would certainly be nice to have similar in sed – Sundeep Aug 19 '16 at 11:22

3 Answers3

3

Never use ranges for exactly this reason (they need a rewrite or duplicate conditions given the slightest requirements change). Use a flag instead:

awk '/start/{f=1} /end/{f=0} f' file

That means you cannot do this in any concise, portable way with sed (there MAY be some bizarre combination of single character runes that will do what you want in GNU sed but if you think repeating the condition is complex and hard to read wait til you see that!), you need a tool like awk that supports variables. With the above approach you can print from all to none of the delimiters just by rearranging the 3 parts of the script (added the {print} just for clarity vs relying on the default behavior):

$ seq 1 10 | awk '/3/{f=1} f{print} /7/{f=0}'
3
4
5
6
7

$ seq 1 10 | awk 'f{print} /3/{f=1} /7/{f=0}'
4
5
6
7

$ seq 1 10 | awk '/3/{f=1} /7/{f=0} f{print}'
3
4
5
6

$ seq 1 10 | awk '/7/{f=0} f{print} /3/{f=1}'
4
5
6
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 2
    Using `awk` hardly meets the requirements of the question. – Jonathan Leffler Aug 20 '16 at 04:06
  • This breaks if the second expression also occurs before the first and/or it is a substring of the first. The key difference between this and the `sed` example that Jonathan Leffler provided is that the sed patterns define a range whereas the awk ones do not. (Although OT for the answer, I do think your contribution is interesting). – starfry Aug 20 '16 at 12:31
  • Note that Jonathan's answer doesn't do what you asked, i.e. to omit the end pattern. The 2nd expression occurring before the first has zero effect (try it) and it doesn't break if the 2nd expression is a substring of the first, you just have to write the correct expression, e.g. `/^7$/`. If you post an example that you think this pattern does not work for, I'll show you how to write it correctly. – Ed Morton Aug 20 '16 at 16:06
  • @JonathanLeffler - the OP asked if this can be done with sed and you and I both answered "no". Given the answer is "no you can't do what you wanted to do with sed", how do you suggest that answer could be improved? You went on to to provide additional information which was how to do something she didn't want with sed while I went on to show how to do exactly what she DID want using awk. How is what I did less appropriate than what you did? If I were reading this in future I'd much rather see how to do what the subject asks for in a similarly available tool than just "can't be done". – Ed Morton Aug 20 '16 at 16:31
  • FWIW before @JonathanLeffler answered, I actually answered the question with sed. Yeah, it's not exactly intuitive to make it robust, but it stars out with a simple `{x;p}` which isn't all that hard to grok. – stevesliva Aug 20 '16 at 18:07
  • But your script doesn't do what's desired which is to just not print the last line of the range. Once you solve the problems of the first `{x;p}` script related to added blank lines and failing for multiple ranges, even the very last, arcane script in your answer `{x;/^$/! p;d};x'` doesn't do what the OP wants as it would delete any blank lines that were present within the range. Try `printf 'a\n\nb\nc\n' | sed -n -e '/a/,/c/{x;/^$/! p;d};x'` and you'll see the output is `a\nb` instead of `a\n\nb`. – Ed Morton Aug 20 '16 at 18:32
  • 1
    @EdMorton - you're right. Have to eat my words. Been enjoying the opportunity to try to MacGyver something with sed, but .... just can't quite do it. – stevesliva Aug 21 '16 at 13:55
  • 1
    Oh, I take that back too. I, as you well anticipated, came up with a completely stupid way to force sed to do it. Ha! – stevesliva Aug 21 '16 at 14:39
1

BSD and GNU sed both agree that you can omit both the first and the last line in the range without repeating either regex, but it is a tad quirky.

sed -n -e '/first-regex/,/second-pattern/ { //!p; }'

(BSD sed requires the semicolon; GNU sed doesn't mind whether it is there or not.)

The empty regex // matches the last regular expression that matched, and in this context, that is either the first pattern (at the beginning of the range) or the second pattern (at the end of the range). Note that the ranges should be disjoint if there is more than one such range.

Given an input file called data (I happened to have this around from playing with another question):

0x0  = 0
0x1  = 1
0x2  = 2
0x3  = 3
0x4  = 4
0x5  = 5
0x6  = 6
0x7  = 7
0x8  = 8
0x9  = 9
0xA  = 0
0xB  = 11
0xC  = 12
0xD  = 13
0xE  = 14
0xF  = 15

you can run:

$ sed -n -e '/0x4/,/0xC/ { //!p; }' data
0x5  = 5
0x6  = 6
0x7  = 7
0x8  = 8
0x9  = 9
0xA  = 0
0xB  = 11
$

I've not yet found a way to omit one of the two patterns (the start or the end pattern) rather than both. My suspicion is that it cannot be done in sed without repeating one or the other regex.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • +1 for sticking to `sed`. Do you know of a clean way to present the `!` if the expression needs to be expressed using double quotes? I had do do this: `sed -n -e "/first-regex/,/second-pattern/ { //"'!"p; }"` (I know going back to double quotes at the end isn't necessary but I wanted to illustrate the issue). I like your answer though - it's clean and simple. – starfry Aug 20 '16 at 10:21
  • For me, `set +H` works wonders, disabling the history I never use. If I wanted C shell, I'd use it! It isn't a problem in scripts (non-interactive shells). At the command line, I'd probably use a variant of what you're using, with double quotes around the range part and single quotes around the rest. You can also use a backslash before the exclamation mark to inhibit history expansion. – Jonathan Leffler Aug 20 '16 at 14:42
  • If I have variables holding the patterns, then repeating the pattern isn't a problem; I simply reference the variable again. – Jonathan Leffler Aug 20 '16 at 14:45
  • @starfry never enclose any script in double quotes, always single. If you need to access the value of a shell variable in any script then the syntax is `cmd 'foo '"$var"' bar'`, not `cmd "foo $var bar"`. – Ed Morton Aug 20 '16 at 16:12
  • @edmorton I had never heard that advice before, is that a generally accepted idiom? Are there any references to it? I am very guilty of `cmd "foo $var bar"` and the lazy-coder's prefrence for double quotes! – starfry Aug 21 '16 at 21:36
  • @starfry: There's a lot of justification for what Ed says, except when you need to interpolate shell variables into the `sed` script. – Jonathan Leffler Aug 21 '16 at 21:39
  • @starfry I'd have to google for a reference but I THINK it's common knowledge (and common sense) in shell programming to use single quotes unless you have a specific reason to use double quotes (e.g. expand a shell variable) and fully understand all of the caveats and to use double quotes unless you have a specific reason to use no quotes (e.g. globbing) and fully understand all of those caveats and in both of the latter 2 cases only to do it in the absolute minimum section of your code that's necessary to achieve whatever your goal is by doing so to avoid undesireable/dangerous side effects – Ed Morton Aug 22 '16 at 00:11
  • 1
    thanks @edmorton for that advice, I'll adjust my behaviour :) – starfry Aug 22 '16 at 07:10
0

The second example below is a sed-only answer that pads the output with blank lines. The third example gives exactly what has been asked for, provided you can choose a pattern that's never in the range that should be kept.

If, within your input file, the range matches only once, this works. It prints what you want starting with a blank line.

sed -n -e '/start-regex/,/end-regex/{x;p}' input-file

For each line in the range, x exchanges the line in the pattern space with the line in the hold space, and p prints the line pulled from the hold space. This is effectively printing every preceeding line.

But, as said, that only works if the range occurs once. If the range occurs more than once, the line matching end-regex is still in the hold space.

So instead, the script below empties out the lines outside the range, stuffs that empty line in the hold space with h, and then runs the x;p which will print a blank line for start-regex and nothing for end-regex:

sed -n -e '/start-regex/,/end-regex/! {s/.//g;h;};x;p' ' input-file

The above, is the most general I can give. It retains blank lines within the range, but is not a perfect solution because it inserts blank lines before the range:


start-regex line 1
  next line is blank...
etc1
start-regex line 2 etc2

To delete blank lines, you can change the final p to /^$/! p, but that will omit blank lines within the input-file range as well as the padding lines added before each range by the script. If you really can't stomach the added blank lines, you could always stick in a placeholder on the non-matching lines:

sed -n -e '/start-regex/,/end-regex/! {s/.*/OMITME/;h;};x;/OMITME/! p' ' input-file

And that still depends on OMITME not being a pattern in the range you want to keep. But you get the desired result:

start-regex line 1
  next line is blank...

  etc1
start-regex line 2
  etc2
stevesliva
  • 5,351
  • 1
  • 16
  • 39
  • idk if this really applies to what you are doing with OMITME (too many mystic runes for my tastes) but FYI here is how you can **create** a placeholder string `aB` that is guaranteed to not exist in the input idiomatically using sed: `sed 's/a/aA/g; s/something/aB/g; do_stuff; s/aB/something/g; s/aA/a/g'`. See http://stackoverflow.com/a/38153467/1745001 for an application and explanation. – Ed Morton Aug 22 '16 at 00:27
  • @EdMorton - sure... just gets even more convoluted. Oh, and I forgot to mention that, since I'm deleting the content of every line outside the range, that this of course doesn't work for scripts that have multiple `/start/,/end/` ranges. – stevesliva Aug 22 '16 at 01:53