3

I have a regular expression that captures a pattern A only if it the string contains a pattern B somewhere before A.

Let's say, for the sake of simplicity, that A is \b\d{3}\b (i.e. three digits) and B is the word "foo".

Therefore the Regex I have is (?<=\b(?:foo)\b.*?)(?<A>\b\d{3}\b).

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    .*?            # variable length
)
(?<A>\b\d{3}\b)    # pattern A

For example, for the string

"foo text 111, 222 and not bar something 333 but foo 444 and better 555"

it captures

(111, 222, 333, 444, 555)

I got a new requirement and now I have to exclude the captures that are preceded by pattern C, lets say that C is the word "bar". What I want to build is a regex that expresses

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    ???????????    # anything that does not contains pattern C
)
(?<A>\b\d{3}\b)    # pattern A

So, in the example string I will have to capture

(111, 222, 444, 555)

Of course something like (?<=\b(?:foo)\b.*?)(?<!\b(?:bar)\b.*?)(?<A>\b\d{3}\b)

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    .*?
)
(?<!               # negative look-behind
    \b(?:bar)\b    # pattern C
    .*?
)
(?<A>\b\d{3}\b)    # pattern A

will not work as it will exclude everything after the first appearance of "bar" and the capture will be

(111, 222)

The regex (?<=\b(?:foo)\b(?!.*?(?:\bbar\b)).*?)(?<A>\b\d{3}\b)

(?<=                     # look-behind
    \b(?:foo)\b          # pattern B
    (?!                  # negative lookahead
        .*?              # variable lenght
        (?:\bbar\b)      # pattern C
    )
    .*?                  # variable lenght
)
(?<A>\b\d{3}\b)          # pattern A

will not work either because for the first "foo" in my test string, it will always find the "bar" as a suffix and it will only capture

(444, 55)

So far, using Conditional Matching of Expressions and (now) knowing that while inside a lookbehind, .net matches and captures from the right to the left, I was able to create the following regex (?<=(?(C)(?!)| (?:\bfoo\b))(?:(?<!\bbar)\s|(?<C>\bbar\s)|[^\s])*)(?<A>\b\d{3}\b)

(?<=                     # look-behind
    (?(C)                # if capture group C is not empty
        (?!)             # fail (pattern C was found)
        |                # else
        (?:\bfoo\b)      # pattern B
    )
    (?:
        (?<!\bbar)\s     # space not preceeded by pattern C (consume the space)
        |
        (?<C>\bbar\s)    # pattern C followed by space (capture in capture group C)
        |
        [^\s]            # anything but space (just consume)
    )*                   # repeat as needed
)
(?<A>\b\d{3}\b)          # pattern A

which works but is too complex as the patters A, B and C are a lot more complex that the examples I have posted here.

Is it possible to simplify this regex? Maybe using balancing groups?

Community
  • 1
  • 1
Edgar Hernandez
  • 4,020
  • 1
  • 24
  • 27
  • So you don't want the neg. lookbehind to skip over digits [like this](http://goo.gl/ppsp2T)? Also would trigger the lookbehinds at `\b` – bobble bubble Jan 15 '16 at 15:06
  • In this case the digits negation works as the example pattern A is composed only of digits, but my real pattern is more complex than that. For example, changing the pattern A to "abc" like [this](http://goo.gl/bu909s) does not work with your solution. – Edgar Hernandez Jan 15 '16 at 15:15
  • 1
    You could avoid skipping over by use of an [additional lookahead like this](http://goo.gl/JOOUb2) (bad performance). – bobble bubble Jan 15 '16 at 15:38

3 Answers3

3

You can use a pattern based on the \G anchor that matches the position after the previous match:

(?:\G(?!\A)|\bfoo\b)(?:(?!\b(?:bar|\d{3})\b).)*(\d{3})

demo

details:

(?:
    \G(?!\A) # contiguous to a previous match and not at the start of the string
  |        # OR
    \bfoo\b  # foo: the condition for the first match
)
(?:(?!\b(?:bar|\d{3})\b).)* # all that is not "bar" or a 3 digit number (*)
(\d{3})

(*) Note that if you can use a better subpattern (i.e. that doesn't test each characters with a lookahead containing an alternation) for your real situation, don't hesitate to change it. (for example, something based on character classes: [^b\d]*(?>(?:\B[b\d]+|b(?!ar\b)|\d(?!\d\d\b))[^b\d]*)*)


An other way: Since .net regex engine is able to store repeated captures, you can write this too:

\bfoo\b(?:(?:(?!\b(?:bar|\d{3})\b).)*(\d{3}))+

But this time, you need to loop over each occurrence of foo to extract results in group 1. It's less handy but the pattern is faster since it doesn't start with an alternation.

Note that if "bar" and "\d{3}" starts and ends with word characters, you can write the pattern in a more efficient way:

\bfoo(?:\W+(?>(?!bar\b)\w+\W+)*?(\d{3}))+\b

Other way: split your string on "foo" and "bar" (preserve the delimiter), loop over each part. When the part is "foo" set a flag to true, when the part is "bar" set it to false, and when it isn't "foo" or "bar" extract the numbers if the flag is true.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • All options are great but the lookbehind version of Kobi is very similar to the current one I have and will not require that the code handles multiple captures for the same match. – Edgar Hernandez Jan 18 '16 at 10:05
  • 1
    @LimoWanKenobi: I think that using a lookbehind is a wrong approach since it needs to parse several times the same substrings. This approach is probably the slowest one of all the answers. *But if you find it eye candy...* – Casimir et Hippolyte Jan 18 '16 at 12:45
  • 1
    @LimoWanKenobi: If I were you, I will try the first pattern (the `\G` one) with a modified second branch able to *jump* quickly to *(ie: that matches all characters greedily until)* the next "foo". Balancing group way is interesting too, but need to be improved to not advance character by character. – Casimir et Hippolyte Jan 18 '16 at 12:50
  • 1
    In this case it was not the _eye candidness_ of the approach what sold me, but the similarity to the current regex (which is using a look-behind already and the code using it is expecting a single capture for each match). Definitely the performance of the look-behind/look-ahead approach is a concern and I will try to work on it using your suggestions in the near future. – Edgar Hernandez Jan 18 '16 at 12:58
  • As for performance, I don't know if it is better to have a lot of short matches. `[^b\d]*` and unfolding-the-loop tend to be faster, but I don't know if you can generalize it when it comes to lookaheads. There are also obvious trade-offs between performance, readability, length, duplicated groups inside the pattern, etc. If performance is important a complicated regex usually isn't the right tool: just like you've suggested in the last paragraph. – Kobi Jan 18 '16 at 13:26
  • By the way, your first pattern can be shorten to `foo|\G(?!\A)(?:(\b\d{3}\b)|(?!bar).)` , though you probably don't like matching character by character. – Kobi Jan 18 '16 at 13:27
  • @Kobi: I don't believe theories about *"simple"*/*"complicated"* patterns, *"two short patterns are better than one long pattern"*, *"complicated patterns are difficult to maintain"* (<- wrong since when the needs change, the pattern must be entirely rewritten and not corrected), etc.. About unfolding-the-loop with lookahead tests, it stays performant as long as lookaheads fail quickly. The main weakness of this subpattern isn't the use of lookaheads but the number of alternatives that may slow down the research in a string with a lot of "b" and digits. – Casimir et Hippolyte Jan 18 '16 at 14:15
  • @Kobi: no need to shorten a pattern if you obtain something slower. The original version of the `\G` pattern advances character by character too with an horrible lookahead before, it isn't obviously the best way, but it's a general answer (OP doesn't search "foo" or "bar", it's an example) that can be easily adapted to other needs. To search "foo" and avoid "bar" I will use something like: `\b(?:\G(?!\A)|foo\b)\W+(?:(?!bar\b)\w+\W+)*?(\d{3})\b` that is faster *(and why not with something to reach the next "foo")* – Casimir et Hippolyte Jan 18 '16 at 14:29
2

One simple option is very similar to Casimir et Hippolyte's second pattern:

foo(?>(?<A>\b\d{3}\b)|(?!bar).)+
  • Start with foo
  • (?>|(?!bar).)+ - Stop matching if you've seen bar.
  • (?<A>\b\d{3}\b) and capture all A's that you see along the way.
  • Atomic group (?>) isn't necessary in this case, backtracking wouldn't mess this up either way.

Working example

Similarly, it can be converted to a lookbehind:

(?<=foo(?:(?!bar).)*?)(?<A>\b\d{3}\b)

This has the benefit of matching only the numbers. The lookbehind asserts there is a foo before A, but there isn't an bar.
Working example

Both of these assume B and C are somewhat simple.

Kobi
  • 135,331
  • 41
  • 252
  • 292
2

Since you've asked, it is possible with balancing groups, but probably not needed.

\A                    # Match from the start of the string
(?>                   # Atomic group. no backsies.
    (?<B>(?<-B>)?foo)            # If we see "foo", push it to stack B.
                                 # (?<-B>)? ensures B only has one item - if there are two,
                                 # one is popped.
    |(?<-B>bar)                  # When we see a bar, reset the foo.
    |(?(B)(?<A>\b\d{3}\b)|(?!))  # If foo is set, we are allowed to capture A.
    |.                           # Else, just advance by one character.
)+
\z                    # Match until the end of the string.

Working example

If we wanted to be extra clever (which we probably don't), we can combine most branches into the conditional:

\A
(?>
  (?(B)
    (?:(?<A>\b\d{3}\b)|(?<-B>bar))
    | # else
    (?<B>foo)
  )
  |.
)+
\z

Working example

Again, it is possible, but balancing groups are not the best option here, mainly because we are not balancing anything, just checking if a flag is set or not.

Kobi
  • 135,331
  • 41
  • 252
  • 292