4

Demo here. The regex:

([^>]+)$

I want to match text at the end of a HTML snippet that is not contained in a tag (i.e., a trailing text node). The regex above seems like the simplest match, but the execution time seems to scale linearly with the length of the match-text (and has causes hangs in the wild when used in my browser extension). It's also equally slow for matching and non-matching text.

Why is this seemingly simple regex so bad?

(I also tried RegexBuddy but can't seem to get an explanation from it.)

Edit: Here's a snippet for testing the various regexes (click "Run" in the console area).
Edit 2: And a no-match test.

adam-p
  • 660
  • 6
  • 16
  • 1
    could you reverse the string and reverse the regex? – Daniel A. White Aug 11 '15 at 21:38
  • 1
    It might or might not help to compile the regex once, outside the function you call it: `var tagRX = new RegExp("([^>]+)$")`, instead of declaring it inline like `results = /([^>]+)$/.match(input)`. I've never had a slow regex so I am curios what your input is – Plato Aug 11 '15 at 21:39
  • 1
    `string.substr(this.lastIndexOf('>')+1)`? – Marc B Aug 11 '15 at 21:41
  • Since you’re writing a browser extension, is it possible to use the DOM? (e.g. if you’re matching this against some element’s `innerHTML`, check its `lastChild` for a `nodeType` of `Node.TEXT_NODE` (3) and get its `nodeValue`.) – Ry- Aug 11 '15 at 21:41
  • 1
    You may also want to read [this classic answer about matching html with regex](http://stackoverflow.com/a/1732454/1380669) – Plato Aug 11 '15 at 21:41
  • @Plato: probably repeated backtracking. since the regex is greedy, it's going to try and grab the entire string, then slowly back off as it finds/eliminates `>` in the string. – Marc B Aug 11 '15 at 21:42
  • @MarcB "it's going to try and grab the entire string" - That's nonsense, but a naive implementation would try each run of non-`>` characters in the string (from left to right) to see whether it ends at the end of the string. That's basically quadratic in the length of the non-`>` chunk. – melpomene Aug 11 '15 at 21:58
  • @Plato: The text is in the demo I linked to. Pre-compiling seems to make no difference. – adam-p Aug 12 '15 at 00:27
  • @DanielA.White: Reversing gives a 10x improvement. Which is really good, although I would hope for better (constant time would be more like 100x). After reading some [performance tips articles](https://www.loggly.com/blog/five-invaluable-techniques-to-improve-regex-performance/) I got the impression that using `$` would help enormously, but I guess not. I naively assumed that `$` would cause the regex to start looking from the end. – adam-p Aug 12 '15 at 01:41

2 Answers2

4

Consider an input like this

abc<def>xyz

With your original expression, ([^>]+)$, the engine starts from a, fails on >, backtracks, restarts from b, then from c etc. So yes, the time grows with size of the input. If, however, you force the engine to consume everything up to the rightmost > first, as in:

.+>([^>]+)$

the backtracking will be limited by the length of the last segment, no matter how much input is before it.

The second expression is not equivalent to the first one, but since you're using grouping, it doesn't matter much, just pick matches[1].

Hint: even when you target javascript, switch to the pcre mode, which gives you access to the step info and debugger:

enter image description here

(look at the green bars!)

georg
  • 211,518
  • 52
  • 313
  • 390
  • I appreciate the explanation, and this may ultimately the correct explanation/answer, but... even though it goes from 77 steps to 11 steps (as reported by regex101), it takes even longer on my test data than my regex did, and the [regex101 snippet](https://regex101.com/r/wU3yE6/1) still times out. – adam-p Aug 12 '15 at 01:26
  • Here's a faster variation: `>([^>]+)$`. 6 steps and 100x faster. – adam-p Aug 12 '15 at 01:31
  • You can see the speed differences here: https://jsbin.com/fekaqo/edit?js,console (click "Run" in the console area). – adam-p Aug 12 '15 at 01:53
  • @adam-p: of course, there's no need to match the "junk". `>(.*)$` is the winner. Please post this as an answer. – georg Aug 12 '15 at 06:02
  • On a second thought, `>([^>]+)$` involves more backtracking than mine, it's an interesting question why it's so much faster. – georg Aug 12 '15 at 06:32
  • 1
    @georg: It's due to the engine implementation. I think you should edit the answer to the actual benchmark (+browser info) to conclude about the performance of the regex. By the way, the original regex has quadratic complexity. – nhahtdh Aug 12 '15 at 07:52
  • @nhahtdh: in the general case, my re [wins](http://jsfiddle.net/a5s09q97/). It must be something in the OP's input that makes their re run that fast. – georg Aug 12 '15 at 08:53
  • @georg: I modified my [jsbin test](https://jsbin.com/fekaqo/edit?js,console) with your regex. I improved/fixed it to work with my (newline-containing) input by replacing `.` with `[\s\S]`. This results in it being the fastest contender. Thanks! – adam-p Aug 13 '15 at 01:42
  • @georg: And... I created a new [jsbin test](https://jsbin.com/boyuye/edit?js,console) that does *not* match. With that input, `[\s\S]+>([^>]+)$` is catastrophic, but `^[\s\S]+>([^>]+)$` (same, but with beginning-of-line specified) is still fast. So, that's the ultimate winner. (I *think* that it still backtracks, but linearly, rather than higher-order. I could be wrong, though.) – adam-p Aug 13 '15 at 02:24
1

You could use the actual DOM instead of Regex, which is time consuming:

var html = "<div><span>blabla</span></div><div>bla</div>Here I am !";

var temp = document.createElement('div');
temp.innerHTML = html;
var lastNode = temp.lastChild || false;

if(lastNode.nodeType == 3){
    alert(lastNode.nodeValue);
}
blex
  • 24,941
  • 5
  • 39
  • 72
  • 1
    Careful about running it on untrusted input, though. `var html = "";` – Ry- Aug 11 '15 at 21:47
  • At this point in the code where this is running, the DOM is no longer available (although it could be refactored to be otherwise). I was also really just asking academically about the regex, not the specific usage. – adam-p Aug 12 '15 at 01:18
  • @minitech: Is there any safe way around this? – nhahtdh Aug 12 '15 at 07:58