1

I have a problem closely related to this question: Regex find match within a string

In that case the problem is to find Warner Music Group instead of XYZ becomes Chief Digital Officer and EVP, Business Development of Warner Music Group for

Ole Abraham  of XYZ becomes Chief Digital Officer and EVP, Business Development of Warner Music Group.

which is solved using .*\bof\s+([^.]+)

Now I have a very similar problem, with the difference that I want all matches, and the previous solution returns only one. Here you have my basic setup with the solution above: https://regex101.com/r/bIbFaW/1

The problem is that for the string

This is a test with a string with punctuation, and an end. Then test words, and more text. And here whith more text with more punctuation, like that.

the pattern .*\bwith(.*?), will only get me more punctuation (a good match), missing an earlier option punctuation from the first sentence.

Is it possible to do this or should I approach it differently? For example with(.*?), gets all matches, but they are the longer options ( a string with punctuation instead of punctuation,). I could then try to find matches within my matches, but doing this at this moment has unrelated overhead which would be nice to avoid if possible.

example text, with colours highlighting different parts of the string

Pablo
  • 1,373
  • 16
  • 36
  • 1
    Perhaps like this `\bwith\b((?:(?!\bwith\b)[^,])*),` https://regex101.com/r/5Ydb6h/1 – The fourth bird Jul 07 '23 at 14:23
  • Your first `.*` is greedy. So you could just replace it with `/.*?\bwith (.*?),/gm` : https://regex101.com/r/bIbFaW/2 – Patrick Janser Jul 07 '23 at 14:24
  • The fourth bird this seems to work, it's a bit intimidating but so far it seems to hold to some quick testing. Feel free to write a solution :) – Pablo Jul 07 '23 at 14:29

2 Answers2

2

You can avoid matching a comma with a negated character class [^,] and match with followed by matching any character except a comma or matching with again using a tempered greedy token.

Then match the comma at the end.

\bwith\b((?:(?!\bwith\b)[^,])*),
  • \bwith\b Match the word with
  • ( Capture group 1
    • (?: Non capture group to repeat as a whole part
      • (?!\bwith\b)[^,] Match any char except a comma if the current position is not directly followed by the word "with"
    • )* Close the non capture group and optionally repeat it
  • ) Close group 1
  • , Match a comma

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

What about if your sentence finishes by a dot? or another punctuation char?

To make it simple and easier to read, without using regex lookaheads, which makes it run with more steps, I would propose this start-up:

\bwith\s+([^,.]*)/g

V1: https://regex101.com/r/kUXu9z/1

  • \b matches the word boundary to avoid matching "with" in "bandwith" or something similar.
  • with matches the word "with".
  • \s+ because we know that they should be at least one spacing char, including possible line feeds or whatever.
  • ([^,.]*) or ([^,.]+) is the capturing group matching any chars which aren't commas or dots. But this list may not be enough as you could have "!", "?", ":", etc.

If we add some usual punctuation: https://regex101.com/r/kUXu9z/2

Using Unicode class characters (available in PHP and JavaScript, but with Python you may need to search if it's available in a lib), we could use the punctuation class of chars \p{P} or \p{Punctuation} and invert it with \P{P}+ in order to match all chars which are not punctuation chars:

/\bwith\s+(\P{P}+)/gu

V3: https://regex101.com/r/kUXu9z/3

Edit (as I didn't see the problematic of the multiple "with")

Sorry, I didn't clearly read/understand the problematic of the first occurrence of "with" which is followed by a second "with" before the comma.

In this case, we effectively need to use a negative lookahead to avoid matching a string containing "with" in it:

/\bwith\s+(?!\P{P}*\bwith\b)(\P{P}+)/gu

I added (?!\P{P}*\bwith\b) after the space chars to check that we don't have some non-punctuation chars followed by the word "with".

V4: https://regex101.com/r/kUXu9z/4

Patrick Janser
  • 3,318
  • 1
  • 16
  • 18
  • Note this is not exactly what I need, since the first match is "with a string with punctuation" instead of "punctuation", the shorter option. – Pablo Jul 07 '23 at 15:27
  • @Pablo Ah, I see! Sorry, I didn't notice that we had twice the word "width" and the last was the one to match. Effectively, in that case, we need to use a negative lookahead. – Patrick Janser Jul 07 '23 at 15:36