0

I'm wondering how should I use regular expression atomic groups (non-capturing group) in sed. Atomic groups are very useful to avoid any denial of service attacks by gobbling up servers memory, what is called catastrophic backtracking, and also peformance wise atomic groups are very useful.

Found the following link , to turn off backtracking using re2 engine.

You can turn off backtracking altogether by using re2 engine (non-backtracking engine). My question is why we can't use the same approach in sed, if it is possible how can we define an atomic group or non capturing group in sed.

Thanks

hwnd
  • 69,796
  • 4
  • 95
  • 132
Tharanga Abeyseela
  • 3,255
  • 4
  • 33
  • 45

2 Answers2

2

You ask several questions. The answer to one has appeared before in SO. Non-matching parens are not provided by sed.

To answer another, you need only read re2's documentation closely. The mechanism used by branch-and-bound matchers like sed's (and also Perl, Python, Java, etc.) and the Deterministic Finite Automaton (DFA)-based matchers like re2 are intrinsically different. There is no cut operation in sed's recognizer that will do what you want.

Having said that, the re2 documentation omits its negatives. Compiling a DFA is much more work than converting a regex to the bytecode used internally by e.g. Perl's matcher. Thus Perl programs are not slowed down by regex compilation. In fact the re2 compiler can "blow up" on certain short regexes and produce a DFA of size exponential in the size of the regex. Thus the compiler takes exponential time to run, and the re2 technique moves the bad behavior from runtime to compilation.

I do agree with the re2 guys that in general it's much better to have the bad behavior depend on the regex than the input. It would be a better, safer world if all the embedded regex recognizers in programming languages used the re2 approach.

Finally, your questions seem to conflate size and run time. The DOS from a backtracking recognizer occurs because certain short inputs require exponential time in the input length to recognize (or reject). Since the regex can never capture more than the (short) input length, capturing or lack of capturing has no effect.

The other kind of DOS you may be thinking of is a user providing a monster input that the sed recognizer is forced to store internally because it has only noncapturing groups, even though the capture is never used. This is certainly a way to make a problem for at least some implementations of sed (hypothetically a smart implementation could determine that the capture is not needed and skip it; I don't think the GNU code does that), but it only happens if you allow huge inputs, which can normally be prevented by other means. Why are the no non-capturing groups in sed? It is historically a very old program going back to some of the very first Unix machines. People didn't worry about DOS in those days.

Community
  • 1
  • 1
Gene
  • 46,253
  • 4
  • 58
  • 96
0

You can avoid backtracking by using anchors and being more verbose in pretty much all regex engines. Also, the capturing and non-capturing groups have very small difference in their overheads, they all just keep the start and end offsets within the input. Capturing groups have the downside of polluting the back-reference namespace.

perreal
  • 94,503
  • 21
  • 155
  • 181