19

I would like to know if there is a simple way to parse HTML in vb.net. I know that HTML is not sctrict subset of XML, but it would be nice if it could be treated that way. Is there anything out there that would let me parse HTML in an XML-like way in VB.net?

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
tooleb
  • 612
  • 3
  • 6
  • 14
  • It might help us to answer if you state what you want to do with it once it's parsed. – Rob Feb 05 '09 at 17:02
  • For now, I'm interested in being able to select all the links, text and images in specific divs. Where the div class or id will/may change from page to page. – tooleb Feb 05 '09 at 17:14
  • it sounds like HTML Agility pack will probably work for me. Are there any other options???? – tooleb Feb 05 '09 at 17:25

5 Answers5

13

'add prog ref too: Microsoft.mshtml

'then on the page:

Imports mshtml

Function parseMyHtml(ByVal htmlToParse$) As String
    Dim htmlDocument As IHTMLDocument2 = New HTMLDocumentClass()
    htmlDocument.write(htmlToParse)
    htmlDocument.close()

    Dim allElements As IHTMLElementCollection = htmlDocument.body.all

    Dim allInputs As IHTMLElementCollection = allElements.tags("a")
    Dim element As IHTMLElement
    For Each element In allInputs
        element.title = element.innerText
    Next

    Return htmlDocument.body.innerHTML
End Function

As found here:

Siddharth Rout
  • 147,039
  • 17
  • 206
  • 250
  • doesn't this essentially use the same libraries that IE uses to load its DOM? I've tried this before, but it always feels so dirty.... – tooleb Apr 09 '10 at 12:56
9

I like Html Agility pack - it's very developer friendly, free and source code is available.

derloopkat
  • 6,232
  • 16
  • 38
  • 45
TcKs
  • 25,849
  • 11
  • 66
  • 104
  • But selfdocumentation code developer friendly is. I understand that term "developer friendly" can be very subjective, however I tried several ways to parse/modify HTML code and this one is simple the best (for .NET and for fee-free), what you can get and absence of documentation don't change it. It's cruel reality. – TcKs Oct 11 '10 at 11:15
  • I wasn't able to get the HTML Agility pack to do anything useful for me. All I was getting was the straight HTML output to the textbox, instead of the parsed formatted HTML. – Joel R. Dec 17 '12 at 20:06
  • @JoelR. You did something horribly wrong. Did you read some tutorials about that? – TcKs Dec 18 '12 at 15:19
  • @Tcks I didn't seem to find any tutorials on CodePlex. But I also didn't look too hard either. If you have any links for some good tutorials it would be helpful. – Joel R. Dec 18 '12 at 16:26
  • 2
    Use "html agility pack tutorial" in google. You can not miss target. – TcKs Dec 18 '12 at 18:32
6

Don't use agility pack, just use mshtml library to access the dom, this is what ie uses and is great for going through HTML elements.

Agility pack is nasty and unnecessarily hackie if you ask me, mshtml is the way to go. Look it up on msdn.

Erx_VB.NExT.Coder
  • 4,838
  • 10
  • 56
  • 92
4

If your HTML follows XHTML standards, you can do a lot of the parsing and processing using the System.XML namespace classes.

If, on the other hand, if what you're parsing is what web developers refer to as "tag soup," you'll need a third-party parser like HTML Agility Pack.

This may be only a partial solution to your problem if you're trying to figure out how a browser will interpret your HTML as each browser parses tag soup slightly differently.

Yes - that Jake.
  • 16,725
  • 14
  • 70
  • 96
1

Is it well formed? If the HTML is in fact well formed then it can be parsed as XML. If it is tag soup and there are unclosed elements and such I would think you would have to hunt around for a third-party solution.

Andrew Hare
  • 344,730
  • 71
  • 640
  • 635