I have a file on this form:
X/this is the first match/blabla
X-this is
the second match-
and here we have some fluff.
And I want to extract everything that appears after "X" and between the same markers. So if I have "X+match+", I want to get "match", because it appears after "X" and between the marker "+".
So for the given sample file I would like to have this output:
this is the first match
and then
this is
the second match
I managed to get all the content between X followed by a marker by using:
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
That is:
grep -Po '(?<=X(.))(.|\n)+(?=\1)'
to match X followed by(something)
that gets captured and matched at the end with(?=\1)
(I based the code on my answer here).- Note I use
(.|\n)
to match anything, including a new line, and that I also use-z
in grep to match new lines as well.
So this works well, the only problem comes from the display of the output:
$ grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
this is the first matchthis is
the second match
As you can see, all the matches appear together, with "this is the first match" being followed by "this is the second match" with no separator at all. I know this comes from the usage of "-z", that treats all the file as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline (quoting "man grep").
So: is there a way to get all these results separately?
I tried also in GNU Awk:
awk 'match($0, /X(.)(\n|.*)\1/, a) {print a[1]}' file
but not even the (\n|.*)
worked.