0

How do I parse HTML using regular expressions in C#?

For example, given HTML code

<s2> t1 </s2>  <img src='1.gif' />  <span> span1 <span/>

I am trying to obtain

1.  <s2>
2.  t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>

How do I do this using regular expressions in C#?

In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.

jason
  • 236,483
  • 35
  • 423
  • 525
Mike108
  • 2,089
  • 7
  • 34
  • 45

5 Answers5

6

Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.

bobbymcr
  • 23,769
  • 3
  • 56
  • 67
  • In my case, the input is NOT well-formed xml. – Mike108 Oct 15 '09 at 02:07
  • 3
    Then you're in for a very complex problem, in general... HTML parsing with all of its implied elements, optional end tags, etc. is no fun. However, you might be able to leverage an existing library, such as... http://www.codeplex.com/htmlagilitypack – bobbymcr Oct 15 '09 at 02:10
  • 2
    No, regular expressions are *not* "a poor way to parse HTML", because that would imply that regular expressions can parse HTML *at all*, which is not the case. It is mathematically proven that regular expressions *cannot* parse HTML. In fact, pretty much every college student has to prove this at some point during a homework assignment or exam or something. – Jörg W Mittag Oct 15 '09 at 02:39
4

This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.

Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
3

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.

nickytonline
  • 6,855
  • 6
  • 42
  • 76
0

I used this regx in C#, and it works. Thanks for all your answers.

<([^<]*)>|([^<]*)
Mike108
  • 2,089
  • 7
  • 34
  • 45
-3

you might want to simply use string functions. make < and > as your indicator for parsing.

junmats
  • 1,894
  • 2
  • 23
  • 36