3

good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

i need following output:

  • group 1: content of h1
  • group 2: content of h1-following text
  • group 3-n: content of subcaptions + text

what i have atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.

does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?

edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>... atm this seems more or less iterate over the whole document and parsing tag for tag ...? any hints?

  • 5
    See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 ;-) – Asher Dunn Jan 19 '10 at 06:52
  • 1
    I remember having it beaten into people's heads not to use regEx to parse HTML 9 years ago on perl websites but the answer here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 was so much better then anything I ever saw. – Erik Jan 19 '10 at 06:54
  • 1
    I wonder if it was possible to just automatically show that post whenever somebody types both "regexp" and "html" into the title box. Would save a lot effort :) – Atli Jan 19 '10 at 07:07
  • :)) thank you ... this was exactly the kind of answer/reason not to use regex! thank you! and as you may noticed my call for "any alternatives?" - i had no clue of any better way .. –  Jan 19 '10 at 07:16
  • possible duplicate of [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) – outis Mar 30 '12 at 19:00

4 Answers4

9

You will really need HTML parser for this

Community
  • 1
  • 1
YOU
  • 120,166
  • 34
  • 186
  • 219
6

Don't use regex to parse HTML. Consider using the HTML Agility Pack.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • it's a bit hard for me, to parse this piece with HTMLAgilityPack, as i do not know which patterns the content-areas have (once they are `
      `, then `

      ` and once simply `

      `). can you give me some hooks? :)
    –  Jan 19 '10 at 08:11
2

There are some possibilities:

REGEX - Fast but not reliable, it cant deal with malformed html.

HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.

SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.

http://developer.mindtouch.com/SgmlReader

Majestic-12 - Good but not so fast as SGMLReader.

http://www.majestic12.co.uk/projects/html_parser.php

Example for SGMLreader (VB.net)

Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)    
Dim XNS As XNamespace 

' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
      XNS = htmldoc.Root.GetDefaultNamespace
Catch
        XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
        XNS = "http://www.w3.org/1999/xhtml"
End If

'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
        Scripts &= link.Value
Next

In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.

lexmooze
  • 381
  • 3
  • 4
  • hey there ... thanks for the late answer :) nice to know, that there are some new libraries out there, which cover this topic ... thanks! –  Dec 19 '11 at 21:22
1

As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler: http://code.google.com/p/fizzler/ Using this you could find all <p> tags using:

var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();

Or find a specific div like <div id="myDiv"></div>:

var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');

It can't get any easier than that!

Jim Brown
  • 498
  • 1
  • 6
  • 20
  • 1
    And to answer your question, you can use the HtmlAgilityPack to manually iterate the document, or you can pass in an XPath argument to automatically get a specific element(s). Or you can use Fizzler to use selectors instead of XPath... – Jim Brown Jan 19 '12 at 17:28