This is a follow up of my previous question How can I use a look after to match either a single or a double quote?. Say we have a Python file with this content:
hello.this_is("bla bla bla")
some random text
hello.this_is('hello hello') hello.this_is("bla")
hello.this_is("1hello") hello.this_is("2hello")
other stuff
I want to extract all the strings within quotes in hello.this_is(
, either single or double quotes, so my output is like:
bla bla bla
hello hello
bla
1hello
2hello
From the answers on the other question I ended up having this expression:
grep -Po '(?<=hello\.this_is\((["'\''])).*(?=\1)' file
Which works well... unless there are two matches of hello.this_is()
in the same line and using the same kind of quotes. In that case, the output is like:
bla bla bla
hello hello
bla
1hello") hello.this_is("2hello # this is wrong
That is, the .*(?=\1)
is a greedy match, so it matches until the last occurrence of the captured group happens again, instead of matching two different patterns.
How can I change the regex in Grep so that it matches everything up to the next captured group in a non greedy way?
To have a very simplified example, it would be like if we want to get all the numbers between the "hello" text in "hello23hello24hello25". Ideally you would get "23" and "24", but instead I get:
$ grep -Po '(?<=(hello)).+(?=\1)' <<< "hello23hello24hello25"
23hello24
Using a more representative example, let's say we use the capture group to match either "hello" or "hillo" in this example:
$ grep -Po '(?<=(hello|hillo)).+(?=\1)' <<< "hello23hello24hello25hillo26hillo27"
23hello24
26
While the desired output would be "23", "24", "26".
I was thinking of replacing the .+
pattern with something like [^\1]+
so it matches everything but the captured group... but it does not work.