extract data between similar patterns

Question

I am trying to use sed to print the contents between two patterns including the first one. I was using this answer as a source.

My file looks like this:

>item_1
abcabcabacabcabcabcabcabacabcabcabcabcabacabcabc
>item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
>item_3
cdecde
>item_4
defdefdefdefdefdefdef

I want it to start searching from item_2 (and include) and finish at next occuring > (not include). So my code is sed -n '/item_2/,/>/{/>/!p;}'.

The result wanted is:

item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb

but I get it without item_2.

Any ideas?

oguz ismail · Answer 1 · 2020-03-23T13:49:31.500

4

Using awk, split input by >s and print part(s) matching item_2.

$ awk 'BEGIN{RS=">";ORS=""} /item_2/' file
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb

edited Mar 23 '20 at 13:49

answered Mar 23 '20 at 13:41

oguz ismail

1
16
47
69

ok this worked well, thanks. But do you know why `sed` didn't work? just in principle or it's just wrong? – plnnvkv Mar 23 '20 at 13:52
2

@Polina `/>/` matches the line containing `item_2` as well as the last line in the range (`>item_3`), thus it lets only the line between them out. – oguz ismail Mar 23 '20 at 13:54

luciole75w · Accepted Answer · 2020-03-24T23:48:55.917

I would go for the awk method suggested by oguz for its simplicity. Now if you are interested in a sed way, out of curiosity, you could fix what you have already tried with a minor change :

sed -n '/^>item_2/ s/.// ; //,/>/ { />/! p }' input_file

The empty regex // recalls the previous regex, which is handy here to avoid duplicating /item_2/. But keep in mind that // is actually dynamic, it recalls the latest regex evaluated at runtime, which is not necessarily the closest regex on its left (although it's often the case). Depending on the program flow (branching, address range), the content of the same // can change and... actually here we have an interesting example ! (and I'm not saying that because it's my baby ^^)

On a line where /^>item_2/ matches, the s/.// command is executed and the latest regex before // becomes /./, so the following address range is equivalent to /./,/>/.

On a line where /^>item_2/ does not match, the latest regex before // is /^>item_2/ so the range is equivalent to /^>item_2/,/>/.

To avoid confusion here as the effect of // changes during execution, it's important to note that an address range evaluates only its left side when not triggered and only its right side when triggered.

Does the OP require the removal of the `>` before `item_2`? What if there were 2 successive `item_2`'s? — potong, Mar 24 '20 at 12:02
@potong Ah that's right, I focused on the original sed command and did not notice that the leading `>` was not in the expected output. Well, as the answer is accepted, I guess that it was not a major concern but it's quite easy to fix so in doubt, and also to deal with possibly successive `item_2` patterns as you pointed out, I'll update the answer. Thanks for the comments. — luciole75w, Mar 24 '20 at 21:56

score 0 · Answer 3 · answered Mar 23 '20 at 14:28

This might work for you (GNU sed):

sed -n ':a;/^>item_2/{s/.//;:b;p;n;/^>/!bb;ba}' file

Turn off implicit printing -n.

If a line begins >item_2, remove the first character, print the line and fetch the next line

If that line does not begins with a >, repeat the last two instructions.

Otherwise, repeat the whole set of instructions.

If there will always be only one line following >item_2, then:

sed '/^>item_2/!d;s/.//;n' file

extract data between similar patterns

3 Answers3