How can I get "grep -zoP" to display every match separately?

Question

I have a file on this form:

X/this is the first match/blabla
X-this is
the second match-

and here we have some fluff.

And I want to extract everything that appears after "X" and between the same markers. So if I have "X+match+", I want to get "match", because it appears after "X" and between the marker "+".

So for the given sample file I would like to have this output:

this is the first match

and then

this is
the second match

I managed to get all the content between X followed by a marker by using:

grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file

That is:

grep -Po '(?<=X(.))(.|\n)+(?=\1)' to match X followed by (something) that gets captured and matched at the end with (?=\1) (I based the code on my answer here).
Note I use (.|\n) to match anything, including a new line, and that I also use -z in grep to match new lines as well.

So this works well, the only problem comes from the display of the output:

$ grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
this is the first matchthis is
the second match

As you can see, all the matches appear together, with "this is the first match" being followed by "this is the second match" with no separator at all. I know this comes from the usage of "-z", that treats all the file as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline (quoting "man grep").

So: is there a way to get all these results separately?

I tried also in GNU Awk:

awk 'match($0, /X(.)(\n|.*)\1/, a) {print a[1]}' file

but not even the (\n|.*) worked.

See [Reading output of a command into an array in Bash](https://stackoverflow.com/questions/11426529/reading-output-of-a-command-into-an-array-in-bash) — Wiktor Stribiżew, Nov 23 '20 at 13:00

Sundeep · Answer 1 · 2020-11-24T02:56:28.990

4

awk doesn't support backreferences within regexp definition.

Workarounds:

$ grep -zPo '(?s)(?<=X(.)).+(?=\1)' ip.txt | tr '\0' '\n'
this is the first match
this is
the second match

# with ripgrep, which supports multiline matching
$ rg -NoUP '(?s)(?<=X(.)).+(?=\1)' ip.txt
this is the first match
this is
the second match

Can also use (?s)X(.)\K.+(?=\1) instead of (?s)(?<=X(.)).+(?=\1). Also, you might want to use non-greedy quantifier here to avoid matching match+xyz+foobaz for an input X+match+xyz+foobaz+

With perl

$ perl -0777 -nE 'say $& while(/X(.)\K.+(?=\1)/sg)' ip.txt
this is the first match
this is
the second match

edited Nov 24 '20 at 02:56

answered Nov 23 '20 at 13:16

Sundeep

23,246
2
28
103

2

Excellent, many thanks, the key was on replacing that `\0` when found, I didn't notice that character was being provided in the output. – fedorqui Nov 23 '20 at 13:52

Ed Morton · Answer 2 · 2020-11-23T18:03:52.283

With GNU awk for multi-char RS, RT, and gensub() and without having to read the whole file into memory:

$ awk -v RS='X.' 'NR>1{print "<" gensub(end".*","",1) ">"} {end=substr(RT,2,1)}' file
<this is the first match>
<this is
the second match>

Obviously I added the "<" and ">" so you could see where each output record starts/ends.

The above assumes that the character after X isn't a non-repetition regexp metachar (e.g. ., ^, [, etc.) so YMMV

score 3 · Answer 3 · answered Nov 23 '20 at 15:21

3

Here is another gnu-awk solution making use of RS and RT:

awk -v RS='X.' 'ch != "" && n=index($0, ch) {
   print substr($0, 1, n-1)
}
RT {
   ch = substr(RT, 2, 1)
}' file

this is the first match
this is
the second match

answered Nov 23 '20 at 15:21

anubhava

761,203
64
569
643

4

It took reading this to realize I was making an assumption about the char not being a regexp metachar - this is more robust. – Ed Morton Nov 23 '20 at 18:04

score 2 · Accepted Answer · answered Nov 23 '20 at 13:23

2

The use case is kind of problematic, because as soon as you print the matches, you lose the information about where exactly the separator was. But if that's acceptable, try piping to xargs -r0.

grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file | xargs -r0

These options are GNU extensions, but then so is grep -z and (mostly) grep -P, so perhaps that's acceptable.

answered Nov 23 '20 at 13:23

tripleee

175,061
34
275
318

Perfect, sometimes it is good to remember that piping is key: instead of doing everything on one baroque command, let's have every piece do its part. Thanks! – fedorqui Nov 23 '20 at 13:51

score 1 · Answer 5 · 2020-11-24T11:58:34.403

1

GNU grep -z terminates input/output records with null characters (useful in conjunction with other tools such as sort -z). pcregrep will not do that:

pcregrep -Mo2 '(?s)X(.)(.+?)\1' file

-onumber used instead of lookarounds. ? lazy quantifier added (in case \1 occurs later).

edited Nov 24 '20 at 11:58

answered Nov 24 '20 at 03:09

How can I get "grep -zoP" to display every match separately?

5 Answers5

Linked