2

Im trying to create a regex to catch [[xyz|asd]], but not [[xyz]] In the text:

'''Diversity Day'''" is the second episode of the [[The Office (U.S. season 1)]|first season]] of the American [[comedy]] [[television program|television series]] ''[[The Office (U.S. TV series)|The Office]]'', and the show's second episode overall. Written by [[B. J. Novak]] and directed by [[Ken Kwapis]], it first aired in the United States on March 29, 2005, on [[NBC]]. The episode guest stars ''Office'' consulting producer [[Larry Wilmore]] as [[List_of_characters_from_The_Office_(US)#Mr._Brown|Mr. Brown]].

The following results should be captured:

[[The Office (U.S. season 1)]|first season]] <-- keep in mind of the "]" before "|", "]" in that case is a literal character not a breaking one "]]"
[[television program|television series]]
[[The Office (U.S. TV series)|The Office]]
[[List_of_characters_from_The_Office_(US)#Mr._Brown|Mr. Brown]]

I was trying to use is:

\[\[([^|]+)\|([^|]+)\]\]

but i cant figure out how to ignore both "|" and "]]" in the groups. [^|(]])] wont work because it wont match "]]" but only the character "]" (it needs to be the whole word)

Please help, thanks!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
lucas
  • 75
  • 2
  • 7

1 Answers1

6

You may rely on a tempered greedy token here:

\[\[((?:(?!]]).)*)\|((?:(?!]]).)*)]]

See the regex demo

Details:

  • \[\[ - 2 [ symbols
  • ((?:(?!]]).)*) - Group 1 (note the * can be turned into a lazy *? here especially if the first parts are shorter than the second parts) capturing:
    • (?:(?!]]).)* - zero or more sequences of
      • . - any char (but a newline, use the pattern with RegexOptions.Singleline if your strings span across multiple lines)...
      • (?!]]) - that is not starting a ]] sequence (i.e. if the . does not match a ] that is followed with another ])
  • \| - a literal |
  • ((?:(?!]]).)*) - Group 2 capturing the same subpattern as Group 2
  • ]] - 2 literal ] on end.

A much more efficient "unrolled" version of this regex is:

\[\[([^]|]*(?:](?!])[^]|]*)*)\|([^]]*(?:](?!])[^]]*)*)]]

See the regex demo. This regex will treat the first | as the inner field separator. See my other answer about how to unroll tempered greedy tokens.

enter image description here

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • WOW! that's exactly what i needed. That was fast! thank you!! – lucas Oct 18 '16 at 20:20
  • I do not want to complicate the pattern further, since I think the strings you deal with are not that long. If they are, you may consider unrolling the tempered greedy tokens as described in [my other answer](http://stackoverflow.com/a/37343088/3832970). – Wiktor Stribiżew Oct 18 '16 at 20:23
  • they're just short strings, so it'll be ok.. thank you very much! – lucas Oct 18 '16 at 20:26
  • Even with a short string, this pattern may produce a catastrophic backtracking if the sequence `]]` is not found. The unrolled design is more appropriate here in particular to exclude the pipe in the first part. – Casimir et Hippolyte Oct 18 '16 at 20:35
  • @CasimiretHippolyte: Note that it is .NET that is much more efficient than PCRE when it comes to backtracking. When `(?s)a(.*?)z` times out in PHP, the same regex works (just slowly) in .NET. Anyway, an unrolled version is `\[\[([^]]*(?:](?!])[^]]*)*)\|([^]]*(?:](?!])[^]]*)*)]]`, or even `\[\[([^]|]*(?:](?!])[^]|]*)*)\|([^]]*(?:](?!])[^]]*)*)]]` (to just get to the first `|` quicker). – Wiktor Stribiżew Oct 18 '16 at 20:37