2

I am using matchcollection while parsing to the html. but this solution take a long time and it fails sometimes. I am thinking if i set matchcollection timeout this trouble will solve. How can i set the matchcollection's timeout ? (framework 4.0)

anchorPattern[0]="<div.*?class=\"news\">.*?<div.*?class=\".*?date.*?\">(?<date>.*?)?</div>.*?<a.*?href=\"(?<link>.*?)\".*?>(?<title>.*?)?</a>.*?<(span.*?class=\".*?desc.*?\">(?<spot>.*?)?</span>)?"
    MatchCollection mIcerik = Regex.Matches(html, anchorPattern[i], RegexOptions.Compiled);
    if (mIcerik.Count > 0)
          ListDegree.Add(i,mIcerik.Count);
Steven Doggart
  • 43,358
  • 8
  • 68
  • 105
RockOnGom
  • 3,893
  • 6
  • 35
  • 53
  • 3
    Do you know that the most upvoted answer in Stack Overflow recommends to avoid to use Regex as a parsing tool for HTML? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Steve Nov 16 '12 at 14:26
  • I've heard the [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) is the go-to HTML/DOM parser for .NET. – Martin Ender Nov 16 '12 at 14:30
  • Yes,I know it. But html source code can't be correct then htmlparser better isn't than. For example sometimes no closing tag in html text. so i prefer to use regex. – RockOnGom Nov 16 '12 at 14:39
  • 2
    @alikoyuncu a good HTML parser is more likely to deal with invalid HTML properly than any regex you could come up with. – Martin Ender Nov 16 '12 at 14:47
  • When you say "it fails sometimes" - what do you mean, i.e. is there an exception message if so, can you give details please. – Barry Kaye Nov 16 '12 at 15:18
  • @m.buettner Nothing else to add. – Steve Nov 16 '12 at 15:53
  • @Barry Kaye there isn't an exception.When I look in myIcerik that seing "Cannot evaluate expression because a thread is stopped at a point where garbage collection is impossible, possibly because the code is optimized." – RockOnGom Nov 23 '12 at 10:12
  • Do you have some test data you could provide? – porges Feb 08 '13 at 19:45

2 Answers2

0

Your regular expression has way too many ".*?" and probably the number of possible combinations get near to "infinite" for some of your inputs. Try using the atomic group "(?>.*?)" instead, to automatically throw away all backtracking positions remembered by any tokens inside the group. That will at least make all regular expression parsings take a finite time.

Fran Casadome
  • 508
  • 4
  • 15
0
TimeSpan timeout = new TimeSpan(0, 1, 0);

anchorPattern[0]="<div.*?class=\"news\">.*?<div.*?class=\".*?date.*?\">(?<date>.*?)?</div>.*?<a.*?href=\"(?<link>.*?)\".*?>(?<title>.*?)?</a>.*?<(span.*?class=\".*?desc.*?\">(?<spot>.*?)?</span>)?"

MatchCollection mIcerik = Regex.Matches(html, anchorPattern[i], RegexOptions.Compiled,timeout);
 
 
if (mIcerik.Count > 0)
      ListDegree.Add(i,mIcerik.Count);


     

The Timespan parameter establishes a timeout interval to match all objects. Or you can use Regex.InfiniteMatchTimeout to indicate that the method should not time out. MSDN regex.Matches()

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
waloar
  • 108
  • 1
  • 1
  • 9