Why does my regex works very slow when it contains '\r'

Question

I'm just trying to replace header tag inside some html with another string. My html looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"><head><title>aboutus</title> 

    <header id="headerfasdfasdfasdf">
       <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pulvinar commodo lorem, sit amet malesuada.</p>
    </header>

<!-- #include virtual="/html/US/global_header.html" --><script type="text/javascript">

   var header = document.getElementsByTagName("header");

    var len = header.length

    if(len > 1)

    {

    header[0].style.display = "none";

    }
</script>

    <!--ls:begin[component-1400226725207]-->

    <!-- OTHER PART IS CUT FOR BREVITY -->

</html>

I tried to parse it with regex <header(.|\n|\r)*<\/header>, but it works really slow until I remove |\r part from it.

Also I have noticed that original regex works fine with html that doesn't contain comments like .

Note that I'm using .NET regex engine with C# and my replace code looks like this:

var regex = @"<header(.|\n|\r)*<\/header>";
var result = Regex.Replace(input, regex, to, RegexOptions.IgnoreCase);

Please help me understand why do I have this issue.

Why? Browsers do not care about CRLF and you should not use RegEx to parse html. Why not just grab it from a renderer? — mplungjan, May 20 '14 at 13:49
@mplungjan, I know that I shouldn't but I don't have any choice for this part I only have result html and some part for insertion. I am more insterested why this regex works slow on this part. — Oleksii Aza, May 20 '14 at 13:58
That regex does *not* look safe (is greedyness wanted?), but regarding the speed remember that `.` with `RegexOptions.Singleline` will match any character including newlines. — Robin, May 20 '14 at 14:03
@Robin Thanks for advise with singleline. Why doesn't it look safe and why '\r' could slow it down? — Oleksii Aza, May 20 '14 at 14:11

score 1 · Answer 1 · edited May 23 '17 at 12:20

if your input is pretty well sanitized (ie if you feel you can use regex to parse HTML), this would probably improve your speed significantly:

var regex = @"<header.*?</header>";
var result = Regex.Replace(input, regex, to, RegexOptions.IgnoreCase|RegexOptions.Singleline);

Avoid using .|\n|\r altogether, there's a flag for what you want to do.
Make your quantifier lazy *? as the header tag probably doesn't take 2 third of your HTML

When backtracking from the end of the file to </header>, the greedyness of (.|\n|\r)* made the regex engine check every element of the alternation before trying </header>. Any element you add to alternation makes for potentially a lot of more work.

score -1 · Answer 2 · answered May 20 '14 at 14:13

-1

Personally, I would use a simpler expression and tell it that . (dot) matches newlines too :-

(?s)(?U)<header.*\/header>

(?s) means match newlines as well as other characters with . (dot)
(?U) means match as few characters as possible

answered May 20 '14 at 14:13

Graphic Equaliser

177
2
11

Whan I'm trying your regex in Regex.Replace - it says System.ArgumentException : parsing "(?s)(?U)" - Unrecognized grouping construct. – Oleksii Aza May 20 '14 at 14:16
That's PHP regex syntax, not C#/POSIX – Ryan Emerle May 20 '14 at 14:57

Why does my regex works very slow when it contains '\r'

2 Answers2