0

Say I have the regex:

(CC|NP)*

As such it creates problems in look-before regexes in Java. How shall I write it to avoid those problem? I thought of re-writing it as:

(CC|NP){1,9}

Testing on regexr it seems like the upperbound is ignored completely. In Java those quantitiers {} seem to work only on non-group regex elements as in:

\w+\[\S{1,9}\]
simpatico
  • 10,709
  • 20
  • 81
  • 126
  • 1
    The upperbound is not ignored. The `global` switch is on there in your regexr. Turn it off, and you'll see that it works: http://regexr.com?31fak – Joseph Silber Jul 06 '12 at 15:30
  • does this global option correspond to anything in Java? – simpatico Jul 06 '12 at 17:45
  • @simpatico The Java Matcher class does not offer a option to get all matches as a array. If you want to match globally, you have to iterate yourself. Besides the String and Matcher classes offer replaceAll Methods to use a pattern for global replace. – Arne Jul 06 '12 at 20:10

2 Answers2

1

Sorry, look behind patterns usually have restrictions on the sub pattern. See f.x. Why doesn't finite repetition in lookbehind work in some flavors?p. Or search for "lookbehind pattern restrictions" on the web.

You may try to write down all fixed length variants of the look behind pattern as alternating pattern. But this might be many...

You may also simulate lookbehind by normally matching the inner pattern and match and group your actual target: (?:CC|NP)*(.*)

Community
  • 1
  • 1
Arne
  • 2,106
  • 12
  • 9
1

I'm not sure of where you percieve the problem. Quantifiers act on groups just like any entity.

So, \w+\[\S{1,9}\] could have been written \w+\[(\S){1,9}\] with the same result.

As far as your example on regexr, nothing is broken there. It matches what it's supposed to.

(PUN|CC|NP){1,3} will greedily try to match any of the alternations (in left-to-right priority). There will be no breaks in what it will match. It matches 1-3 consecutive occurances of PUN or CC or NP.

The sample string you provided had a space between CC's, so since a space does not exist in the regex, it is not matched. The only thing that is matching is a single CC.

If you want to account for a space, it can be added to the grouping like this:
(?:(?:PUN|CC|NP)\s*){1,3}

If you want to only allow spaces between the alternation's, it can be done like this:
(?:PUN|CC|NP)(?:\s*(?:PUN|CC|NP)){0,2}