1

This is a follow up of my previous question How can I use a look after to match either a single or a double quote?. Say we have a Python file with this content:

hello.this_is("bla bla bla")
some random text
hello.this_is('hello hello')  hello.this_is("bla")
hello.this_is("1hello")  hello.this_is("2hello")
other stuff

I want to extract all the strings within quotes in hello.this_is(, either single or double quotes, so my output is like:

bla bla bla
hello hello
bla
1hello
2hello

From the answers on the other question I ended up having this expression:

grep -Po '(?<=hello\.this_is\((["'\''])).*(?=\1)' file

Which works well... unless there are two matches of hello.this_is() in the same line and using the same kind of quotes. In that case, the output is like:

bla bla bla
hello hello
bla
1hello")  hello.this_is("2hello    # this is wrong

That is, the .*(?=\1) is a greedy match, so it matches until the last occurrence of the captured group happens again, instead of matching two different patterns.

How can I change the regex in Grep so that it matches everything up to the next captured group in a non greedy way?

To have a very simplified example, it would be like if we want to get all the numbers between the "hello" text in "hello23hello24hello25". Ideally you would get "23" and "24", but instead I get:

$ grep -Po '(?<=(hello)).+(?=\1)' <<< "hello23hello24hello25"
23hello24

Using a more representative example, let's say we use the capture group to match either "hello" or "hillo" in this example:

$ grep -Po '(?<=(hello|hillo)).+(?=\1)' <<< "hello23hello24hello25hillo26hillo27"
23hello24
26

While the desired output would be "23", "24", "26".

I was thinking of replacing the .+ pattern with something like [^\1]+ so it matches everything but the captured group... but it does not work.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • https://regex101.com/r/NW98tx/1 – Wiktor Stribiżew Sep 06 '19 at 09:16
  • Oh man, all of this for just a `?`. Thanks @Wiktor, I thought I had already addressed the greediness while researching, but now I notice I hadn't entirely. – fedorqui Sep 06 '19 at 09:20
  • Yes, it is the most frequent regex question on SO. Also, `[^\1]+` is not negating Group 1 value, to match any text that does not match the start of the Group 1 value you may use `(?:(?!\1).)+`. – Wiktor Stribiżew Sep 06 '19 at 09:21
  • @fedorqui You can parse the python file into an ast. This is a nice blog article about this: https://suhas.org/function-call-ast-python/ – hek2mgl Sep 06 '19 at 21:22
  • 1
    @hek2mgl that's a nice article! I have worked a bit with ast and it is great to go beyond `json.loads()` for strange parsing cases. Thanks for sharing. – fedorqui Sep 10 '19 at 14:58

0 Answers0