-2

I'm just trying to replace header tag inside some html with another string. My html looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"><head><title>aboutus</title> 

    <header id="headerfasdfasdfasdf">
       <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pulvinar commodo lorem, sit amet malesuada.</p>
    </header>

<!-- #include virtual="/html/US/global_header.html" --><script type="text/javascript">

   var header = document.getElementsByTagName("header");

    var len = header.length

    if(len > 1)

    {

    header[0].style.display = "none";

    }
</script>

    <!--ls:begin[component-1400226725207]-->

    <!-- OTHER PART IS CUT FOR BREVITY -->

</html>

I tried to parse it with regex <header(.|\n|\r)*<\/header>, but it works really slow until I remove |\r part from it.

Also I have noticed that original regex works fine with html that doesn't contain comments like <!--ls:begin[component-1400226725207]-->.

Note that I'm using .NET regex engine with C# and my replace code looks like this:

var regex = @"<header(.|\n|\r)*<\/header>";
var result = Regex.Replace(input, regex, to, RegexOptions.IgnoreCase);

Please help me understand why do I have this issue.

Oleksii Aza
  • 5,368
  • 28
  • 35
  • 1
    Why? Browsers do not care about CRLF and you should not use RegEx to parse html. Why not just grab it from a renderer? – mplungjan May 20 '14 at 13:49
  • 4
    relevent: http://stackoverflow.com/a/1732454/2424975 – Cereal May 20 '14 at 13:50
  • @mplungjan, I know that I shouldn't but I don't have any choice for this part I only have result html and some part for insertion. I am more insterested why this regex works slow on this part. – Oleksii Aza May 20 '14 at 13:58
  • That regex does *not* look safe (is greedyness wanted?), but regarding the speed remember that `.` with `RegexOptions.Singleline` will match any character including newlines. – Robin May 20 '14 at 14:03
  • @Robin Thanks for advise with singleline. Why doesn't it look safe and why '\r' could slow it down? – Oleksii Aza May 20 '14 at 14:11

2 Answers2

1

if your input is pretty well sanitized (ie if you feel you can use regex to parse HTML), this would probably improve your speed significantly:

var regex = @"<header.*?</header>";
var result = Regex.Replace(input, regex, to, RegexOptions.IgnoreCase|RegexOptions.Singleline);
  • Avoid using .|\n|\r altogether, there's a flag for what you want to do.
  • Make your quantifier lazy *? as the header tag probably doesn't take 2 third of your HTML

When backtracking from the end of the file to </header>, the greedyness of (.|\n|\r)* made the regex engine check every element of the alternation before trying </header>. Any element you add to alternation makes for potentially a lot of more work.

Community
  • 1
  • 1
Robin
  • 9,415
  • 3
  • 34
  • 45
-1

Personally, I would use a simpler expression and tell it that . (dot) matches newlines too :-

(?s)(?U)<header.*\/header>

(?s) means match newlines as well as other characters with . (dot)
(?U) means match as few characters as possible