0

I'm trying to parse html page and I use the following regular expression:

var regex = new Regex(@"<tag1 id=.id1.>.*<tag2>", RegexOptions.Singleline);

"tag1 id =.id.1" occurs in document only once. "tag2" occurs nearly 50 times after the occurance of "tag 1". But when I try to match page code with my regular expression, it returns only 1 match. Moreover, when I change RegexOptions to "None" or "Multiline" no matches are returned. I'm very confused about this and would appreciate any help.

yoozer8
  • 7,361
  • 7
  • 58
  • 93
Yury Pogrebnyak
  • 4,093
  • 9
  • 45
  • 78
  • 2
    [Parsing Html The Cthulhu Way](http://stackoverflow.com/a/1732454/540352) :) – Laoujin Sep 20 '12 at 15:15
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Austin Salonen Sep 20 '12 at 15:21

2 Answers2

2

Parsing Html with RegEx is a very bad idea and its unreliable because there still exists a lot of "broken html" in the world. To parse HTML, I would suggest using the HTML Agility Pack. It is an excellent library for parsing HTML and I never had an issue with any HTML I've fed into it.

Icemanind
  • 47,519
  • 50
  • 171
  • 296
2

Leaving aside the obvious exhortations about not using regex to parse HTML, I can explain to you why you're seeing what you're seeing.

If tag1 occurs in your text only once, then the regex can only match it once, so there can never be more than one match. Regular expression matches "consume" the text they have matched, so the next match attempt starts at the end of the last successful match.

This leads to the next problem: .* is greedy, so it matches (with RegexOptions.Singleline) until the end of the string and then backtracks until the last <tag2> it finds in order to allow a successful match. Which is another reason why you only get one match.

As for your second question: Why do the matches go away if you don't use RegexOptions.Singleline? Simple: Without that option, the dot . cannot match newlines, and there appears to be at least one newline between tag1 and the first tag2.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561