1

This question focuses on pcre-regular expression as used by grep -P.

Imagine I have a string abcRabcSxyxz and search for a substring which starts with abc and ends with x, but with the restriction that no shorter substring of this match would also also match.

My first attempt was a non-greedy regexp,

grep -Po 'abc.*?x' <<<abcRabcSxyxz

but this returns abcRabcSx, while I would like to find just abcSx. It is obvious why even my non-greedy attempt still provides a match which is too long; I need the regexp engine to try harder. My second attempt was

grep -Po '(?>abc.*?)x' <<<abcRabcSxyxz

which did not provide a match at all (maybe I don't really understand the usage of ($?...) explained here).

Any easy solution for my problem anyone?

UPDATE I see from the comments that my example does not precisely explain what i am searching for, so here a more general description:

I am searching for matches of the form PXQ, wher P, X and Q are arbitrary patterns, and X should not contain a match of P. Plus, I don't want to literally retype the pattern P inside X.

For instance

`[(][^(]*[)]`

would be a possible (but not satsifying) solution for the concrete case that I am searching for a parenthesized expression which does not contain another parenthesized (here, P is [(], X is an arbitrary string, and Q is [)]), but even this example shows that I have to literally repeat the information contained in P, when specifying the middle part ([^(]*), to make sure that my P is not contained there). I am looking for a way which makes this explicit repetition unnecessary.

user1934428
  • 19,864
  • 7
  • 42
  • 87
  • Maybe with a negative lookahead: `abc(.(?!abc))*?x` (although not sure how this will perform). – sp00m Jun 18 '20 at 11:42
  • This would work: `grep -Po 'abc.*\Kabc.*?x' << – Felix Kling Jun 18 '20 at 11:44
  • What if the string is something like `abc1xabc123x`? Or is that not possible? – Felix Kling Jun 18 '20 at 11:47
  • @FelixKling : This would work for the concrete exampl I gave, but would not work for instance for the slightly modified `grep -Po 'abc.*\Kabc.*?x' << – user1934428 Jun 18 '20 at 11:48
  • @FelixKling : Yes. I think I will update my question to make the matching criterion more clear. – user1934428 Jun 18 '20 at 11:49
  • The prefix would have to be made optional to match your new example: `(abc.*)?\Kabc.*?x` (still don't know whether that's the right way to go about it). – Felix Kling Jun 18 '20 at 11:49
  • See https://ideone.com/WYKWqa – Wiktor Stribiżew Jun 18 '20 at 11:54
  • @FelixKling : Your new solution, and also the one linked to by Wiktor Stribizew, and by the moderators you decided that my question is a duplicate (and this was indeed the case before I updated it and provided more details), let me conclude that there seems to be indeed no good solution for my problem. – user1934428 Jun 18 '20 at 12:02
  • 1
    Recursive patterns might be what you are looking for but how to integrate "not P" into "X" might depend on what "X" is in the end. https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC25 . E.g. `([(])[^(?1)]+[)]` for your parenthesis example. – Felix Kling Jun 18 '20 at 12:11
  • @FelixKling : I like the idea, but it means that the middle part must say something like: _Zero or more items which do **not** match the first group, what ever it is_. I don't see how to express this general _do no match_. I tried `'(abc)(?!?1)*?x'`, on the grounds that `(?!...)` says "do not match this", but get an error '_nothing to repeat_'. It seems that the `?1` does not pick up the `(abc)`. – user1934428 Jun 18 '20 at 12:23
  • You need `()` around the reference: `(?1)`. This seems to work: `(abc).(?!(?1))*?x`. – Felix Kling Jun 18 '20 at 12:56
  • This questions seems to address your problem: https://stackoverflow.com/questions/27385942/why-does-a-simple-non-greedy-regex-greedily-include-additional-characters-be . Instead of repeating the pattern you can use the recursive reference. – Felix Kling Jun 18 '20 at 12:58
  • @FelixKling : The solution there is pretty much the same as I tried in my last comment (except the additonal use of a non-capturing group `(?:...)`), but anyway, adapting the example `HOHO(?:(?!HOHO).)*?_HO_` in that article to my case, would give `grep -Po '(abc)(?:(?!(?1).))*?x' << – user1934428 Jun 18 '20 at 14:32
  • Mmh, it works when I put the `.` before negative lookahead... strange. – Felix Kling Jun 18 '20 at 14:59
  • 1
    @user1934428: Your adaptation is wrong, the correct pattern will be: `grep -Po '(abc)(?:(?!(?1)).)*?x'`. Also, Javascript doesn't use the pcre regex engine and doesn't have the `(?n)` feature to refer to a subpattern. To do it in javascript, you have to rewrite the full subpattern: `/abc(?:(?!abc).)*?x/` (but nothing forbids to build your pattern dynamically using the `RegExp` constructor. – Casimir et Hippolyte Jun 18 '20 at 16:56
  • @CasimiretHippolyte : Thanks for explaining. Indeed, this works! I was refering to Javascript only because the thread refered to by Wiktor Stribitzew discussed a Javascript solution. In case enough moderators vote to reopen my question, could you propose your solution as an answer, and I will accept it. – user1934428 Jun 19 '20 at 06:05
  • @CasimiretHippolyte : As the question is open for answers, may I ask you to add your solution as an answer? – user1934428 Jun 19 '20 at 11:41

1 Answers1

1

Interesting question. Much of this having been worked out in comments, thanks Casimir et Hippolyte, Felix Kling, and user1934428.

The solution uses PCRE and is as follows:

grep -Po '(abc)(?:(?!(?1)).)*?x' <<< abcRabcSxyxz

We know the result will start with "abc" and end in "x". So let us wall through how this result works.

  • We group the expected output (abc) to start.
  • A ( followed by ?: prevents the subpattern from capturing or counted.
  • Next up is a negative look ahead assertions (?!.
  • The subject of the look ahead is matched pattern 1 (in this case abc).
  • The . matches any character, in this case matching the S.
  • Ending the group with )*?, an un-greedy, matching few as zero characters.
  • The final entry is the x, which the question designated as the ending character.
James Risner
  • 5,451
  • 11
  • 25
  • 47