0

I have a program extracting some information from a lot of large html pages. I found that the last line (myRegex.Match(detailPage)) takes most of the execution time. Is the regex pattern optimized?

const string strRegex = @"prepend-top.*?<h1[^>]*?>(?<name>.+?)\s*<a.*?
    Create\ Date.*?<label>(?<create>.*?)</label>.*?
    <a.*?id\s*=\s*""period_report"".*?href\s*=\s*""(?<url>.*?)""";
const RegexOptions myRegexOptions =
            RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled |
            RegexOptions.IgnorePatternWhitespace;
var myRegex = new Regex(strRegex, myRegexOptions);
var m = myRegex.Match(detailPage);

The HTML code looks like (The html file size is about 30K, however, most of the the html is javascript code):

<div class="span-24 prepend-top">
<h1>XXX XXX XXXX 
    <a href="https://....">Back to Search Results</a></h1>
</div>

<div class="span-18">
<div class="top-content">

<script type="text/javascript">
 .....
</script>

    <div class="detailHeaderContainer">
        <div class="leftBlock">

            <div class="left staticlabel leftStaticlabelWidth inlineColumn">
                <label>
                    Product Type:
                </label>
            </div>
            <div class="left leftDynamiclabelWidth dynamiclabel">
                <label>Type 2</label>
            </div>
            <div class="clear"></div>

            <div class="left staticlabel leftStaticlabelWidth inlineColumn">
                <label>
                    Create Date:
                </label>
            </div>
            <div class="left leftDynamiclabelWidth dynamiclabel">
                <label></label>
            </div>
ca9163d9
  • 27,283
  • 64
  • 210
  • 413
  • welcome to the world of regex. how large is the string that the regex is checking? – Brian Aug 17 '12 at 19:50
  • @Brian The size of the html file is about 30KB. – ca9163d9 Aug 17 '12 at 19:52
  • You're using to many `.`, it's an eager operator. Try changing it to something more specific, maybe a char sequence. Also, surround whichever group that you don't want to capture with `(?:your_pattern)` – Andre Calil Aug 17 '12 at 19:57
  • Agreed. this comple of a regex on a 30 kb file is going to be slow. is regex the only option or could you do it with jquery or use XPath as an alternative? – Brian Aug 17 '12 at 20:01
  • 3
    Consider that learning experience and switch to more suitable tools: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Alexei Levenkov Aug 17 '12 at 20:05
  • 1
    A better question would be "why am I using a regular expression to parse HTML?" – Ed S. Aug 17 '12 at 20:11

2 Answers2

0

I would suggest taking a quick look at the regex best practices guide on MSDN and This blog entry on the BCL team's blog, they go into the behaviors of Regex and can provide guidance as to why regexs can be slow

Mgetz
  • 5,108
  • 2
  • 33
  • 51
0

Create static RegEx class can save a lot of time.

ca9163d9
  • 27,283
  • 64
  • 210
  • 413