Regex any characters except some

Question

Im trying to create a regex to catch [[xyz|asd]], but not [[xyz]] In the text:

'''Diversity Day'''" is the second episode of the [[The Office (U.S. season 1)]|first season]] of the American [[comedy]] [[television program|television series]] ''[[The Office (U.S. TV series)|The Office]]'', and the show's second episode overall. Written by [[B. J. Novak]] and directed by [[Ken Kwapis]], it first aired in the United States on March 29, 2005, on [[NBC]]. The episode guest stars ''Office'' consulting producer [[Larry Wilmore]] as [[List_of_characters_from_The_Office_(US)#Mr._Brown|Mr. Brown]].

The following results should be captured:

[[The Office (U.S. season 1)]|first season]] <-- keep in mind of the "]" before "|", "]" in that case is a literal character not a breaking one "]]"
[[television program|television series]]
[[The Office (U.S. TV series)|The Office]]
[[List_of_characters_from_The_Office_(US)#Mr._Brown|Mr. Brown]]

I was trying to use is:

\[\[([^|]+)\|([^|]+)\]\]

but i cant figure out how to ignore both "|" and "]]" in the groups. [^|(]])] wont work because it wont match "]]" but only the character "]" (it needs to be the whole word)

Please help, thanks!

btw, in [[xyz|asd]], should be captured in two groups "xyz" and "asd" — lucas, Oct 18 '16 at 20:16
Please edit your post to format it, it's difficult to read and understand. — Casimir et Hippolyte, Oct 18 '16 at 20:17

score 6 · Accepted Answer · edited May 23 '17 at 12:00

6

You may rely on a tempered greedy token here:

\[\[((?:(?!]]).)*)\|((?:(?!]]).)*)]]

See the regex demo

Details:

\[\[ - 2 [ symbols
((?:(?!]]).)*) - Group 1 (note the * can be turned into a lazy *? here especially if the first parts are shorter than the second parts) capturing:
- (?:(?!]]).)* - zero or more sequences of
  - . - any char (but a newline, use the pattern with RegexOptions.Singleline if your strings span across multiple lines)...
  - (?!]]) - that is not starting a ]] sequence (i.e. if the . does not match a ] that is followed with another ])
\| - a literal |
((?:(?!]]).)*) - Group 2 capturing the same subpattern as Group 2
]] - 2 literal ] on end.

A much more efficient "unrolled" version of this regex is:

\[\[([^]|]*(?:](?!])[^]|]*)*)\|([^]]*(?:](?!])[^]]*)*)]]

See the regex demo. This regex will treat the first | as the inner field separator. See my other answer about how to unroll tempered greedy tokens.

edited May 23 '17 at 12:00

Community

1
1

answered Oct 18 '16 at 20:17

Wiktor Stribiżew

607,720
39
448
563

WOW! that's exactly what i needed. That was fast! thank you!! – lucas Oct 18 '16 at 20:20
I do not want to complicate the pattern further, since I think the strings you deal with are not that long. If they are, you may consider unrolling the tempered greedy tokens as described in [my other answer](http://stackoverflow.com/a/37343088/3832970). – Wiktor Stribiżew Oct 18 '16 at 20:23
they're just short strings, so it'll be ok.. thank you very much! – lucas Oct 18 '16 at 20:26
Even with a short string, this pattern may produce a catastrophic backtracking if the sequence `]]` is not found. The unrolled design is more appropriate here in particular to exclude the pipe in the first part. – Casimir et Hippolyte Oct 18 '16 at 20:35
@CasimiretHippolyte: Note that it is .NET that is much more efficient than PCRE when it comes to backtracking. When `(?s)a(.*?)z` times out in PHP, the same regex works (just slowly) in .NET. Anyway, an unrolled version is `\[\[([^]]*(?:](?!])[^]]*)*)\|([^]]*(?:](?!])[^]]*)*)]]`, or even `\[\[([^]|]*(?:](?!])[^]|]*)*)\|([^]]*(?:](?!])[^]]*)*)]]` (to just get to the first `|` quicker). – Wiktor Stribiżew Oct 18 '16 at 20:37

Regex any characters except some

1 Answers1