How to parse this piece of HTML?

Question

good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

i need following output:

group 1: content of h1
group 2: content of h1-following text
group 3-n: content of subcaptions + text

what i have atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.

does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?

edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>... atm this seems more or less iterate over the whole document and parsing tag for tag ...? any hints?

See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 ;-) — Asher Dunn, Jan 19 '10 at 06:52
I remember having it beaten into people's heads not to use regEx to parse HTML 9 years ago on perl websites but the answer here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 was so much better then anything I ever saw. — Erik, Jan 19 '10 at 06:54
I wonder if it was possible to just automatically show that post whenever somebody types both "regexp" and "html" into the title box. Would save a lot effort :) — Atli, Jan 19 '10 at 07:07
:)) thank you ... this was exactly the kind of answer/reason not to use regex! thank you! and as you may noticed my call for "any alternatives?" - i had no clue of any better way .. — , Jan 19 '10 at 07:16
possible duplicate of [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) — outis, Mar 30 '12 at 19:00

score 9 · Accepted Answer · edited May 23 '17 at 11:54

9

You will really need HTML parser for this

edited May 23 '17 at 11:54

Community

1
1

answered Jan 19 '10 at 06:51

YOU

120,166
34
186
219

score 6 · Answer 2 · answered Jan 19 '10 at 06:51

6

Don't use regex to parse HTML. Consider using the HTML Agility Pack.

answered Jan 19 '10 at 06:51

Mark Byers

811,555
193
1,581
1,452

it's a bit hard for me, to parse this piece with HTMLAgilityPack, as i do not know which patterns the content-areas have (once they are `
– Jan 19 '10 at 08:11

score 2 · Answer 3 · answered Dec 19 '11 at 13:29

There are some possibilities:

REGEX - Fast but not reliable, it cant deal with malformed html.

HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.

SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.

http://developer.mindtouch.com/SgmlReader

Majestic-12 - Good but not so fast as SGMLReader.

http://www.majestic12.co.uk/projects/html_parser.php

Example for SGMLreader (VB.net)

Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)    
Dim XNS As XNamespace 

' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
      XNS = htmldoc.Root.GetDefaultNamespace
Catch
        XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
        XNS = "http://www.w3.org/1999/xhtml"
End If

'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
        Scripts &= link.Value
Next

In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.

hey there ... thanks for the late answer :) nice to know, that there are some new libraries out there, which cover this topic ... thanks! — , Dec 19 '11 at 21:22

score 1 · Answer 4 · answered Jan 19 '12 at 16:58

1

As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler: http://code.google.com/p/fizzler/ Using this you could find all <p> tags using:

var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();

Or find a specific div like <div id="myDiv"></div>:

var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');

It can't get any easier than that!

answered Jan 19 '12 at 16:58

Jim Brown

498
1
6
20

1

And to answer your question, you can use the HtmlAgilityPack to manually iterate the document, or you can pass in an XPath argument to automatically get a specific element(s). Or you can use Fizzler to use selectors instead of XPath... – Jim Brown Jan 19 '12 at 17:28

How to parse this piece of HTML?

4 Answers4