Deleting the block between two regex markers when a pattern is matched inside the block

Question

Let's suppose the following structure:

  -   key1: value11
      key2:
      - value21
      - value22
      - value23
      key3: value31
      key4:
      - value41
      - value42
      key5: value51
  -   key1: value12
      key2:
      - value24
      - value25
      key3: value32
      key5: value52
  -   key1: value13
      key2:
      - value26
      key3: value33
      key4:
      - value43
      - value44
      - value45
      key5: value53

Is it possible to remove all the blocks between (and including) the begin and end marker regexes:

 - begin marker: '^[[:blank:]]{2}-[[:blank:]]{3}key1:[[:blank:]].+$'
 - end marker:   '^[[:blank:]]{6}key5:[[:blank:]].+$'

when the following regex is matched inside the block(s):

matching pattern: '^[[:blank:]]{6}key3:[[:blank:]]value32$'?

The goal is to obtain:

  -   key1: value11
      key2:
      - value21
      - value22
      - value23
      key3: value31
      key4:
      - value41
      - value42
      key5: value51
  -   key1: value13
      key2:
      - value26
      key3: value33
      key4:
      - value43
      - value44
      - value45
      key5: value53

The begin marker could also serve as an end marker is the second marker occurrence is not deleted during the block removal(s).

I have unsuccessfully tried multiple approaches with sed/awk, such as this one inspired from 4.21 paragraph at this post:

sed ':t
/^[[:blank:]]{2}-[[:blank:]]{3}key1:[[:blank:]].+$/,/^[[:blank:]]{6}key5:[[:blank:]].+$/ {      # For each line between these block markers
        /^[[:blank:]]{6}key5:[[:blank:]].+$/!{                                                  # If we are not at the /end/ marker
                $!{                                                                             # nor the last line of the file
                        N;                                                                      # add the Next line to the pattern space
                        bt
                }                                                                               # and branch (loop back) to the :t label
        }                                                                                       # This line matches the /end/ marker
        /^[[:blank:]]{6}key3:[[:blank:]]value32$/d;                                             # If /regex/ matches, delete the block
}' file

score 2 · Accepted Answer · answered Jan 30 '20 at 00:53

2

The file format looks like a YAML. Then why don't you use yq to filter it? Then you can just say:

yq -y '[ .[] | select (.key3 != "value32") ]' file

which results:

- key1: value11
  key2:
  - value21
  - value22
  - value23
  key3: value31
  key4:
  - value41
  - value42
  key5: value51
- key1: value13
  key2:
  - value26
  key3: value33
  key4:
  - value43
  - value44
  - value45
  key5: value53

You may need to install yq with pip install yq or something similar.

answered Jan 30 '20 at 00:53

tshiono

21,248
2
14
22

You're right, this is a yaml file and yq is potentially a perfect candidate for that job. I previously dismissed Mike Farah's yq due to too many current unsolved issues (https://github.com/mikefarah/yq/issues), but the yq you're pointing to (Andrey Kislyuk's tool) seems to do a better job and deserves to be considered. Its usage seems very simple; I just need to find a way to use it with variable value (value32 in the example). – jean-christophe manciot Feb 01 '20 at 10:36

Ed Morton · Answer 2 · 2020-01-29T20:35:20.397

sed is the right tool for doing s/old/new/ on individual strings, that is all. For anything more interesting you should be using awk for clarity, portability, robustness, efficiency, etc.

You don't actually need the first regexp you specify given the sample input/output you posted, e.g. with GNU awk for multi-char RS and RT:

awk -v RS='[[:blank:]]{6}key5:[[:blank:]][^\n]+\n' -v ORS= '
    !/\n[[:blank:]]{6}key3:[[:blank:]]value32\n/{ print $0 RT }
' file
  -   key1: value11
      key2:
      - value21
      - value22
      - value23
      key3: value31
      key4:
      - value41
      - value42
      key5: value51
  -   key1: value13
      key2:
      - value26
      key3: value33
      key4:
      - value43
      - value44
      - value45
      key5: value53

or with any awk:

awk '
{ rec = rec $0 ORS }
/^[[:blank:]]{6}key5:[[:blank:]].+$/ {
    if ( rec !~ /\n[[:blank:]]{6}key3:[[:blank:]]value32\n/ ) {
        printf "%s", rec
    }
    rec=""
}
' file
  -   key1: value11
      key2:
      - value21
      - value22
      - value23
      key3: value31
      key4:
      - value41
      - value42
      key5: value51
  -   key1: value13
      key2:
      - value26
      key3: value33
      key4:
      - value43
      - value44
      - value45
      key5: value53

but you can use that first regexp too if you like, e.g.:

awk '
/^[[:blank:]]{2}-[[:blank:]]{3}key1:[[:blank:]].+$/ { inBlock=1 }
inBlock { rec = rec $0 ORS }
/^[[:blank:]]{6}key5:[[:blank:]].+$/ {
    if ( rec !~ /\n[[:blank:]]{6}key3:[[:blank:]]value32\n/ ) {
        printf "%s", rec
    }
    rec=""
    inBlock=0
}
' file
  -   key1: value11
      key2:
      - value21
      - value22
      - value23
      key3: value31
      key4:
      - value41
      - value42
      key5: value51
  -   key1: value13
      key2:
      - value26
      key3: value33
      key4:
      - value43
      - value44
      - value45
      key5: value53

I really like the simplicity of the first awk solution; could you explain why the '^' has been removed from the regex and the format of the ORS awk variable? — jean-christophe manciot, Jan 31 '20 at 14:40
Also, why can't I replace value32 in the matching pattern with ${var} and the simple quotes by double quotes? — jean-christophe manciot, Jan 31 '20 at 15:01
`^` means "start of string" (sometimes mis-stated as "start of line" because often the string being processed is is a single line, just like people often say that `$` means "end of line" but it doesn't, it means "end of string"). The string in question is a multi-line block of text starting at the `- key1:` line so it'd be wrong to look for the `key3:` line at the stort of it, it's in the middle of it. With `ORS=` I'm setting `ORS` to the null string so awk doesn't add a newline after my `print` statement since I already have the newline printing as part of `RT`. — Ed Morton, Jan 31 '20 at 15:07
Awk is not shell, it's a completely separate tool with it's own syntax, semantics, and context. `${var}` is how in shell you'd get the value of a shell variable - you can no more do that in an awk script you call from shell than you could in a C program you call from shell. You should always enclose shell scripts and strings in single quotes unless you **need** double, see https://mywiki.wooledge.org/Quotes for how quotes work in shell and see https://stackoverflow.com/q/19075671/1745001 for how to use the value of a shell variable inside an awk script. — Ed Morton, Jan 31 '20 at 15:11
`awk -v val="$var" '... if (rec !~ ("\n[[:blank:]]{6}key3:[[:blank:]]" val "\n") ) ...` — Ed Morton, Jan 31 '20 at 15:13

potong · Answer 3 · 2020-01-31T15:34:40.613

1

This might work for you (GNU sed):

sed -E '/^\s{2}-\s{3}key1:\s/{:a;N;/^\s{6}key5:\s/M!ba;/^\s{6}key3:\svalue32$/Md}' file

Gather up a group of lines between key1 and key5 and if the group contains the desired string, delete the entire group.

N.B. The use of the M flag, which allows multi-line matches.

In essence:

sed '/key1/{:a;N;/key5/!ba;/key3.*value32$/Md}' file

edited Jan 31 '20 at 15:34

answered Jan 30 '20 at 08:36

potong

55,640
6
51
83

score 0 · Answer 4 · answered Jan 29 '20 at 22:15

If you really wanted sed, you can store the range in the hold space, then print the hold space if and only if it doesn't contain the string you want to exclude the whole range for:

/^[[:blank:]]{2}-[[:blank:]]{3}key1:[[:blank:]].+$/,/^[[:blank:]]{6}key5:[[:blank:]].+$/{
   /^[[:blank:]]{2}-[[:blank:]]{3}key1:[[:blank:]].+$/h
   //!H
   /^[[:blank:]]{6}key5:[[:blank:]].+$/{
     g
     /\n[[:blank:]]{6}key3:[[:blank:]]value32\n/!p
   }
   d
}

The above must be run with sed -Ef cmdfile file.

One of several annoyances with this is having to repeat the patterns.

Deleting the block between two regex markers when a pattern is matched inside the block

4 Answers4