Regex to find C-style multiline comments in O(n) time

Question

I'm trying to find a JS RegExp that finds all C-style multiline comment of the form /\*[\s\S]*?\*/ in O(n).

Other comment forms are intentionally omitted because it's easy to make regexes that find all C-style comments of the forms //.* and /\*[\s\S]*?(?:\*/|$) in O(n).

This question is focused on the O(n) part of /\*[\s\S]*?\*/-comments.

The problem

It's easy to find a JS RegExp that can find all of those comments in O(n^2) time (see the first regex), but I can't seem to find one that does it in O(n). They all take O(n^2) time for input strings "/* ".repeat(n). Example:

const measure = fn => { const start = performance.now(); fn(); return performance.now() - start; }

// takes about 0.7ms on my machine
measure(() => /\/\*[\s\S]*?\*\//.test("/* ".repeat(1_000)))
// takes about 70ms on my machine
measure(() => /\/\*[\s\S]*?\*\//.test("/* ".repeat(10_000)))

The reason why the above regex in O(n^2) is that it takes O(n) steps to reject any suffix of the input string. The regex engine will move the pattern across the string from left to right checking all O(n) many suffixes (spec). Since it takes O(n) many steps to reject each of the O(n) many suffixes, the total runtime is O(n^2).

Non-solutions

It's doable if we only wanted to find the first comment

// takes about 3ms on my machine
measure(() => /^(?:[^/]|\/(?!\*))*\/\*[\s\S]*?\*\//.test("/* ".repeat(1_000_000)))

and it's obviously doable to make a little function that repeatedly uses this regex to find all matches, but this is not what I want.

Unambiguous regexes like in this answer also run in O(n^2) because the problem isn't backtracking.

// takes about 260ms on my machine
measure(() => /\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\//.test("/* ".repeat(10_000)))

I'm looking for a regex that I can plug into functions like string.replace(/<regex>/g, ...) and that does its job in O(n). Is it possible for such a regex to exist, and if not, why?

JavaScript regex engine is based on ECMAScript, not on PCRE. — Wiktor Stribiżew, Dec 09 '20 at 12:30
The problem would not be with the pattern, which is fine, but with the algorithm that the regex engine uses. Nothing requires it uses the one you laid out. — Bergi, Dec 09 '20 at 13:26
@Bergi The [ECMAScript spec](https://tc39.es/ecma262/#sec-regexpbuiltinexec) describes the algorithm I outlined and all JS regex engines are required to implement it or something that behaves the same way. — Michael Schmidt, Dec 09 '20 at 16:48
@MichaelSchmidt Yes, "behaves the same way" as in "has the same result", but not requiring an inefficient `O(n²)` algorithm. — Bergi, Dec 09 '20 at 16:50
@Bergi Sounds like you know an algorithm that can do better than O(n^2) and supports capturing groups. I haven't looked too deeply into the subject but I'm not aware of any techniques that can do it. Could you please share? — Michael Schmidt, Dec 09 '20 at 16:56
@MichaelSchmidt Tbh i don't remember whether there is such an algorithm (or, given that your regex doesn't use capturing groups, a sufficient one that doesn't support them). My main point is that the runtime complexity doesn't (only) depend on the pattern, but (mainly) on the algorithm that is used for the matching, and you don't have a lot of control over that. There is, after all, only one regular language containing exactly the words you want, and not that many regular expressions are available to recognise it. — Bergi, Dec 09 '20 at 18:32
That's sort of the point of the question. Given that the regex engine uses this algorithm, for a specific regular language, can we find a regex with O(n) runtime, and if not, why? Since each regular language is accepted by infinitely many regexes, can we find one that does it in O(n)? — Michael Schmidt, Dec 09 '20 at 20:32

score 0 · Answer 1 · answered Dec 15 '20 at 15:34

I figured it out. Using the pattern to find the first comment in O(n) and the y flag, we can repeatedly search for the "next first" comment.

someString.replace(/((?:[^/]|\/(?!\*))*)(\/\*[\s\S]*?\*\/)/yg, "$1{replacement}")

// takes about 3ms on my machine
measure(() => /((?:[^/]|\/(?!\*))*)(\/\*[\s\S]*?\*\/)/y.test("/* ".repeat(1_000_000)))

// takes about 2ms on my machine
measure(() => " /* bar */ ".repeat(10_000).replace(/((?:[^/]|\/(?!\*))*)(\/\*[\s\S]*?\*\/)/yg, "$1foo"))
// takes about 20ms on my machine
measure(() => " /* bar */ ".repeat(100_000).replace(/((?:[^/]|\/(?!\*))*)(\/\*[\s\S]*?\*\/)/yg, "$1foo"))

The trick is that the y flag disables the left-to-right movement of the pattern matcher, so the regex engine doesn't have to try out all possible suffixes in case no match was found. Assuming that the regex engine efficiently backtracks such that the above pattern rejects or accepts each input in O(n), the total runtime of the above replacement is O(n).

Unfortunately, this solution isn't a drop-in replacement since the replacement has re-insert the prefix before the comment.

I also think that an O(n) comment regex without the use of the y flag is impossible because it would have to reject each rejecting suffix in O(1) and accept each accepting suffix in O(m) (where m is the number of characters of the comment), which seems impossible to do. However, I don't have proof of this.

Regex to find C-style multiline comments in O(n) time

The problem

Non-solutions

1 Answers1