I'm trying to find a JS RegExp that finds all C-style multiline comment of the form /\*[\s\S]*?\*/
in O(n).
Other comment forms are intentionally omitted because it's easy to make regexes that find all C-style comments of the forms //.*
and /\*[\s\S]*?(?:\*/|$)
in O(n).
This question is focused on the O(n) part of /\*[\s\S]*?\*/
-comments.
The problem
It's easy to find a JS RegExp that can find all of those comments in O(n^2) time (see the first regex), but I can't seem to find one that does it in O(n). They all take O(n^2) time for input strings "/* ".repeat(n)
. Example:
const measure = fn => { const start = performance.now(); fn(); return performance.now() - start; }
// takes about 0.7ms on my machine
measure(() => /\/\*[\s\S]*?\*\//.test("/* ".repeat(1_000)))
// takes about 70ms on my machine
measure(() => /\/\*[\s\S]*?\*\//.test("/* ".repeat(10_000)))
The reason why the above regex in O(n^2) is that it takes O(n) steps to reject any suffix of the input string. The regex engine will move the pattern across the string from left to right checking all O(n) many suffixes (spec). Since it takes O(n) many steps to reject each of the O(n) many suffixes, the total runtime is O(n^2).
Non-solutions
It's doable if we only wanted to find the first comment
// takes about 3ms on my machine
measure(() => /^(?:[^/]|\/(?!\*))*\/\*[\s\S]*?\*\//.test("/* ".repeat(1_000_000)))
and it's obviously doable to make a little function that repeatedly uses this regex to find all matches, but this is not what I want.
Unambiguous regexes like in this answer also run in O(n^2) because the problem isn't backtracking.
// takes about 260ms on my machine
measure(() => /\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\//.test("/* ".repeat(10_000)))
I'm looking for a regex that I can plug into functions like string.replace(/<regex>/g, ...)
and that does its job in O(n). Is it possible for such a regex to exist, and if not, why?