matchcollection timeout

Question

I am using matchcollection while parsing to the html. but this solution take a long time and it fails sometimes. I am thinking if i set matchcollection timeout this trouble will solve. How can i set the matchcollection's timeout ? (framework 4.0)

anchorPattern[0]="<div.*?class=\"news\">.*?<div.*?class=\".*?date.*?\">(?<date>.*?)?</div>.*?<a.*?href=\"(?<link>.*?)\".*?>(?<title>.*?)?</a>.*?<(span.*?class=\".*?desc.*?\">(?<spot>.*?)?</span>)?"
    MatchCollection mIcerik = Regex.Matches(html, anchorPattern[i], RegexOptions.Compiled);
    if (mIcerik.Count > 0)
          ListDegree.Add(i,mIcerik.Count);

Do you know that the most upvoted answer in Stack Overflow recommends to avoid to use Regex as a parsing tool for HTML? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Steve, Nov 16 '12 at 14:26
I've heard the [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) is the go-to HTML/DOM parser for .NET. — Martin Ender, Nov 16 '12 at 14:30
Yes,I know it. But html source code can't be correct then htmlparser better isn't than. For example sometimes no closing tag in html text. so i prefer to use regex. — RockOnGom, Nov 16 '12 at 14:39
@alikoyuncu a good HTML parser is more likely to deal with invalid HTML properly than any regex you could come up with. — Martin Ender, Nov 16 '12 at 14:47
When you say "it fails sometimes" - what do you mean, i.e. is there an exception message if so, can you give details please. — Barry Kaye, Nov 16 '12 at 15:18
@Barry Kaye there isn't an exception.When I look in myIcerik that seing "Cannot evaluate expression because a thread is stopped at a point where garbage collection is impossible, possibly because the code is optimized." — RockOnGom, Nov 23 '12 at 10:12

score 0 · Answer 1 · answered Feb 08 '13 at 19:42

Your regular expression has way too many ".*?" and probably the number of possible combinations get near to "infinite" for some of your inputs. Try using the atomic group "(?>.*?)" instead, to automatically throw away all backtracking positions remembered by any tokens inside the group. That will at least make all regular expression parsings take a finite time.

score 0 · Answer 2 · edited Feb 26 '23 at 09:02

TimeSpan timeout = new TimeSpan(0, 1, 0);

anchorPattern[0]="<div.*?class=\"news\">.*?<div.*?class=\".*?date.*?\">(?<date>.*?)?</div>.*?<a.*?href=\"(?<link>.*?)\".*?>(?<title>.*?)?</a>.*?<(span.*?class=\".*?desc.*?\">(?<spot>.*?)?</span>)?"

MatchCollection mIcerik = Regex.Matches(html, anchorPattern[i], RegexOptions.Compiled,timeout);
 
 
if (mIcerik.Count > 0)
      ListDegree.Add(i,mIcerik.Count);

The Timespan parameter establishes a timeout interval to match all objects. Or you can use Regex.InfiniteMatchTimeout to indicate that the method should not time out. MSDN regex.Matches()

matchcollection timeout

2 Answers2