1

Several sources linked below seem to indicate regex wasn't designed for inverse matching - why not?

Recently, while trying to put together an answer for a question about a regex to match everything that was left after a specific pattern, I encountered several issues that left me curious about the limitations of regex.

Suppose we have some string: a simple line of text. I have a regex [a-zA-Z]e that will match one letter, followed by an e. This matches 3 times, on le, ne, and te. What if I want to match everything except patterns that match the regex? Suppose I want to capture a simp, li, of, and xt., including spaces (line breaks optional.) I later learned this behavior is called inverse matching, and shortly after, that it's not something regex easily supports.

I've examined some resources, but couldn't find any concrete answer on why inverse matching isn't "good".

  • Negative lookaheads appear useful for determining if a matched string does not contain some specific string, and are in fact used in several answers as methods to achieve this behavior (or something similar) - but they seem designed to act as a way to disqualify matches, as opposed to capturing non-matching input.
  • Negative lookaheads apparently shouldn't try to do this and aren't good at it either, choosing to leave inverse matching to the language they're being used with.
  • My own attempt at inverse matching was pointed out to be situational and very fragile, and looks convoluted even to me. In the comments, Wiktor Stribizew mentioned that "[...] in Java, you can't write a regex that matches any text other than some multicharacter string. With capturing, something can be done, but it is inefficient[.]"
  • Capture groups (the other method I was considering) appear to have the potential to dramatically slow the regex in more than one language.

All of these seem to indicate regex wasn't designed for inverse pattern matching, but none of them are immediately obvious as to the reasoning behind that. Why wasn't regex designed with built-in ability to perform inverse pattern matching?

Nick Reed
  • 4,989
  • 4
  • 17
  • 37
  • 1
    A full answer would explain how regular expressions relate to automata theory. Voting to close as too broad. – tripleee Aug 23 '19 at 06:13

1 Answers1

2

While direct regex, as you pointed out, does not easily support the functionality you want, a regex split, does easily support this. Consider the following two scripts, first in Java and then in Python:

String input = "a simple line of text.";
String[] parts = input.split("[a-z]e");
System.out.println(Arrays.toString(parts));

This prints:

[a simp,  li,  of , xt.]

In Python, we can try something very similar:

inp = "a simple line of text."
parts = re.split(r'[a-z]e', inp)
print(parts)

This prints:

['a simp', ' li', ' of ', 'xt.']

The secret sauce which is missing in pure regex is that of parsing or iteration. A good programming language, such as the above, will expose an API which can iterate an input string, using a supplied pattern, and rollup the portions from the split pattern.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • That agrees with what I'd looked up, and I wasn't aware the part that was missing was parsing/iteration. Parsing looks fairly heavy to implement, as does iteration - is that why regex doesn't do it? I'm guessing it's a *big* step from seeing a string as a list of tokens to actually being able to walk through each one and perform actions on it. – Nick Reed Aug 23 '19 at 15:01
  • @NickReed Actually, IMO iterating a string is the easy part, and implementing a regex engine is the hard part, but I guess it's all a matter of opinion. – Tim Biegeleisen Aug 23 '19 at 16:31