0

Can somebody explain me why the following text:

<p>some text some text...</p>
<p>another text another <b>text</b>again</p>

can't be parsed with the following regular expression:

<p>.*?</p>

(to retrieve every paragraph). The regular expression that should match the text between the first opening <p> tag and the last closing </p> tag doesn't work either:

<p>.*</p>
John Saunders
  • 160,644
  • 26
  • 247
  • 397
Niccolo
  • 401
  • 1
  • 6
  • 15
  • 4
    Wow look someone is parsing HTML with Regex! (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – kennytm Mar 03 '10 at 13:40
  • 6
    "Doesn't work" is almost *never* an appropriate level of diagnostics. What does it do, compared with what you *want* it to do? – Jon Skeet Mar 03 '10 at 13:40
  • 2Jon Skeet: well - "no matches are found" – Niccolo Mar 03 '10 at 13:43
  • Could you post some code sample of how you retrieve the match? – Aurril Mar 03 '10 at 13:44
  • have you tested your regex on other platforms, like an online regex tester? – dan Mar 03 '10 at 13:47
  • I feel that if I had to pick one take away from SO, it's that you shouldn't parse HTML with Regex, as it is not a regular language. – Ken Mar 03 '10 at 13:48
  • It could be XML or SGML... admittedly it's probably HTML and might be better served with a state machine but if it's just extracting paragraph info the the HTML aspect shouldn't be of concern. @Serge, how is it failing? Can you show the code you are using this Regex within. – Lazarus Mar 03 '10 at 13:49
  • 2Lazarus: The code is simple: var matches = Regex.Matches(inputText, regexPattern, options); The arguments are provided from the UI. Some other simple regexs are matched just fine. – Niccolo Mar 03 '10 at 14:06

3 Answers3

1

My first guess is that you are attempting a multi line match without telling the regex engine to do so. Take a look at the MSDN doc for passing in the flag.

rerun
  • 25,014
  • 6
  • 48
  • 78
1

You can't parse HTML with RegEx.

Community
  • 1
  • 1
Jeff Yates
  • 61,417
  • 20
  • 137
  • 189
0

Besides the fact that it's dangerous to parse (X)HTMl with regex, try with the RegexOptions.Singleline

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288