Why is this sed command only working on every other match?

Question

Here's a sed command, works great, just on every other line (simplified for your convenience):

cat testfile.txt | sed -E "/PATTERN/,/^>/{//!d;}"

if my testfile.txt is

>PATTERN
1 
2
3

>PATTERN
a
b
c

>PATTERN
1 
2
3

>PATTERN
a
b
c

>asdf
1
2
3

>asdf
a
b
c

Expected output:

>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3

>asdf
a
b
c

actual output:

>PATTERN
>PATTERN
a
b
c

>PATTERN
>PATTERN
a
b
c

>asdf
1
2
3

>asdf
a
b
c

-An aisde-

(The actual goal is to find a one of a group of patterns then delete the stuff that comes after it until the next occurence of a ">" symbol {also delete that line which I can do by piping to a grep -v})

I more or less got guidance by following what I found here. I've had this work for me. Here's an exact example (not that you have the file to look at it)

for line in $(cat bad_results.txt)
do
       echo "removing $line"
       cat 16S.fasta | sed  "/$line/,/^>/{//!d;}" | grep $line -v > temp_stor.fasta
done

[How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) — glenn jackman, Jan 30 '21 at 04:31
No answers in the question, please. I have rolled back/edited your question and removed the answer. Add the answer in the answer section only. — Sabito stands with Ukraine, Jan 30 '21 at 06:44

Sundeep · Answer 1 · 2021-01-30T06:08:12.510

2

/PATTERN/,/^>/ will match from a line containing PATTERN to a line starting with > (which can be a line containing PATTERN). You should instead match an empty line, like so:

$ sed '/PATTERN/,/^$/{/PATTERN/!d}' ip.txt
>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3

>asdf
a
b
c

Your aside isn't very clear to me, but if you want to delete the line with PATTERN as well, you can simplify it to:

$ sed '/PATTERN/,/^$/d' ip.txt
>asdf
1
2
3

>asdf
a
b
c

You can also use:

awk -v RS= -v ORS='\n\n' '!/PATTERN/'

but it will have an extra empty line at the end of the output. The advantage is that instead of your for loop, you can do this:

awk 'BEGIN{FS="\n"; ORS="\n\n"}
     NR==FNR{a[">" $0]; next}
     !($1 in a)' bad_results.txt RS= 16S.fasta

The above code stores each line of bad_results.txt in an associative array, with > character prefixed. And then, contents of 16S.fasta will be printed only if entire line starting with > isn't present in bad_results.txt.

If you want a partial match:

awk 'BEGIN{FS="\n"; ORS="\n\n"}
     NR==FNR{a[$0]; next}
     {for (k in a) if(index($1, k)) next; print}' bad_results.txt RS= 16S.fasta

edited Jan 30 '21 at 06:08

answered Jan 30 '21 at 05:37

Sundeep

23,246
2
28
103

I'm sorry about the confusing aside. I guess I wanted to add background which may help. It's interesting that it matches to the empty line. Why does my original "skip" and only work on every other occurrence of `>` ? – Tclack88 Jan 30 '21 at 05:49
I'd edited with `awk` solution (last one) which is what I *think* you want, can you check and confirm? – Sundeep Jan 30 '21 at 05:51
I'm not sure I fully understand. It ought to match to any line that just _starts_ with a `>`. Unless you're suggesting that it matches up to *and including* that line and then sed moves on. eg: ``` >PATTERN_A ------- | a ------ sed | a ------ see's this. | >PATTERN_B ----- as one chunk | b b >PATTERN_C ---- sed see's c ---- this as the next occurrence c ``` – Tclack88 Jan 30 '21 at 05:59
1

yeah, the whole line matches since `sed` works line by line basis by default – Sundeep Jan 30 '21 at 06:01
one final question: If there's not a blank line (imagine the original provided example but without spacing). Is there a way to match up to the first occurrence of a `>` then delete what occurs between and also continue (so if >PATTERN occurs twice in a row it can be removed?) – Tclack88 Jan 30 '21 at 06:07
Yeah, you can do it, but you need to change the question or ask another one... also, describe only your full usecase (the one with bad_results.txt and for loop) AND clearly mention if you want to match whole line or partial line... (example: if bad_results.txt has `xyz`, should you match only `>xyz` or `>ABCDxyz123` can also match) – Sundeep Jan 30 '21 at 06:16
I appreciate it. You've solved my particular dilemma and have explained why it appeared to be skipping, I was just hoping to find a more general solution in the instance that I didn't "luck out" with a blank line – Tclack88 Jan 30 '21 at 06:25
1

@Tclack88 here's an example of what you are looking for: https://stackoverflow.com/questions/63826761/how-to-do-multiple-match-and-print-different-number-of-lines-after-each-pattern ... – Sundeep Jan 30 '21 at 06:33

potong · Answer 2 · 2021-01-31T11:43:01.073

This might work for you (GNU sed):

sed -E '/PATTERN/{p;:a;$!{N;/\n>/!s/\n//;ta};D}' file

As has been already stated, the range operator matches from PATTERN to a line beginning >. The latter line may also contain PATTERN but is not matched, hence the alternating pattern.

The solution above, does not use the range operator but instead gathers the lines from the first containing PATTERN to the line before a line beginning >.

If a line contains PATTERN it is printed, then subsequent lines are collected until the end-of-file or a line begins >.

Within this collection, newlines are removed - essentially making the first line in the pattern space the concatenation of one or more lines.

On a match (or end-of-file) this long line is removed and any line still in the pattern space is processed as if it had been read in as part of the normal sed cycle.

N.B. The difference between the d and the D commands is the d command deletes the pattern space and immediately begins the next sed cycle which involves reading in the next line of input. Whereas the D command removes everything up to and including the first newline in the pattern space and then begins the next sed cycle. However if the pattern space is not empty, the reading in of the next line from the input is forgone and then the sed cycle resumed.

An alternative:

sed '/^>/{h;/^>PATTERN/p};G;/\n>PATTERN/!P;d' file

score 1 · Answer 3 · answered Jan 30 '21 at 16:25

In your range pattern match, the second element 'consumes' the line so that the start of the range no longer sees that block as a match. This is why you apparently have 'skipping.' This can be fixed by using a lookahead that does not consume characters to match. Unfortunately, sed lacks lookaheads.

Perl is really a better choice than sed for complex multi line matches involving lookaheads.

Here is a Perl that reads the file and applies the regex /(?:^>PATTERN)|(?:^>[\s\S]*?)(?=\v?^>|\z)/ (Demo) to it:

$ perl -0777 -lnE 'while(/(?:^>PATTERN)|(?:^>[\s\S]*?)(?=\v?^>|\z)/gm) { say $& }' file
>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3

>asdf
a
b
c

Aside: Please read Looping through the content of a file in Bash. The way you are doing it is not idea. Specifically, read here on the side effects of using cat in a Bash loop.

score 0 · Accepted Answer · answered Jan 30 '21 at 06:59

To answer the question as to why it seemed to be skipping every other occurrence (as fleshed out in the comments of Sundeep's answer. See his answer to work around this)

The apparent skipping was just an illusion. sed is greedy; it found the first occurrence of PATTERN and up to and including the next line starting with a >. It then deletes everything between (as instructed). sed then continues where it left off and as such doesn't "see" that last line as a new occurrence

to be clear:

>PATTERN     <--- sed see's the first occurrence here------------------|
a                                                                      |(this whole
a                                                                      |chunk is
a                                                                      |considered
                                                                       |by sed)
>PATTERN     <--- then matches up to here (the next occurence of ">")--|
b            <--- then continues from here "missing" the match of PATTERN above
b
b

>PATTERN
c
c
c

Why is this sed command only working on every other match?

4 Answers4