15

A recent blog entry by a Jeff Atwood says that you should never parse HTML using regular expressions - yet doesn't give an alternative.

I want to scrape search search results, extracting values:

<div class="used_result_container"> 
   ...
      ...
         <div class="vehicleInfo"> 
            ...
               ...
                  <div class="makemodeltrim">
                     ...
                     <a class="carlink" href="[Url]">[MakeAndModel]</a>
                     ...
                  </div> 
                  <div class="kilometers">[Kilometers]</div> 
                  <div class="price">[Price]</div> 
                  <div class="location">
                     <span class='locationText'>Location:</span>[Location]
                  </div> 
               ...          
            ...
         </div> 
      ...
   ...
</div> 

...and it repeats

You can see the values I want to extract, [enclosed in brackets]:

  • Url
  • MakeAndModel
  • Kilometers
  • Price
  • Location

Assuming we accept the premise that parsing HTML:

What's the way to do it?

Assumptions:

  • native Win32
  • loose html

Assumption clarifications:

Native Win32

  • .NET/CLR is not native Win32
  • Java is not native Win32
  • perl, python, ruby are not native Win32
  • assume C++, in Visual Studio 2000, compiled into a native Win32 application

Native Win32 applications can call library code:

  • copied source code
  • DLLs containing function entry points
  • DLLs containing COM objects
  • DLLs containing COM objects that are COM-callable wrappers (CCW) around managed .NET objects

Loose HTML

  • xml is not loose HTML
  • xhtml is not loose HTML
  • strict HTML is not loose HTML

Loose HTML implies that the HTML is not well-formed xml (strict HTML is not well-formed xml anyway), and so an XML parser cannot be used. In reality I was present the assumption that any HTML parser must be generous in the HTML it accepts.


Clarification#2

Assuming you like the idea of turning the HTML into a Document Object Model (DOM), how then do you access repeating structures of data? How would you walk a DOM tree? I need a DIV node that is a class of used_result_container, which has a child DIV of class of vehicleInfo. But the nodes don't necessarily have to be direct children of one another.

It sounds like I'm trading one set of regular expression problems for another. If they change the structure of the HTML, I will have to re-write my code to match - as I would with regular expressions. And assuming we want to avoid those problems, because those are the problems with regular expressions, what do I do instead?

And would I not be writing a regular expression parser for DOM nodes? i'm writing an engine to parse a string of objects, using an internal state machine and forward and back capture. No, there must be a better way - the way that Jeff alluded to.

I intentionally kept the original question vague, so as not to lead people down the wrong path. I didn't want to imply that the solution, necessarily, had anything to do with:

  • walking a DOM tree
  • xpath queries

Clarification#3

The sample HTML I provided I trimmed down to the important elements and attributes. The mechanism I used to trim the HTML down was based on my internal bias that uses regular expressions. I naturally think that I need various "sign-posts in the HTML that I look for.

So don't confuse the presented HTML for the entire HTML. Perhaps some other solution depends on the presence of all the original HTML.

Update 4

The only proposed solutions seem to involve using a library to convert the HTML into a Document Object Model (DOM). The question then would have to become: then what?

Now that I have the DOM, what do I do with it? It seems that I still have to walk the tree with some sort of regular DOM expression parser, capable of forward matching and capture.

In this particular case i need all the used_result_container DIV nodes which contain vehicleInfo DIV nodes as children. Any used_result_container DIV nodes that do not contain vehicleInfo has a child are not relevant.

Is there a DOM regular expression parser with capture and forward matching? I don't think XPath can select higher level nodes based on criteria of lower level nodes:

\\div[@class="used_result_container" && .\div[@class="vehicleInfo"]]\*

Note: I use XPath so infrequently that I cannot make up hypothetical xpath syntax very goodly.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • +1 You've already specified that you need to accept badly formed HTML. Other possible assumptions you could specify. Solution should be as resistant as possible to changes in the structure of the page being scraped. Also specify what languages are acceptable and are .NET/COM components acceptable? – MarkJ Nov 24 '09 at 15:04
  • 1
    Parsing HTML is not generally a bad idea, it is a bad idea to try it with regular expressions. – Svante Nov 24 '09 at 15:46
  • COM components are acceptable from a Win32 application, preferably if they are already registered on a supported Microsoft Windows operating system. .NET components can only can called from native Win32 if they have a COM Callable Wrapper (CCW), which depends on the library. – Ian Boyd Nov 24 '09 at 16:00
  • You should consider that the time for retrieving the web page will almost always be longer than the parsing in a more high-level language than C++. – weismat Nov 24 '09 at 16:11
  • The use of C++ is not for performance, it is there to enforce the assumtpion that i'm doing this from non-.NET, non-Java, non-Python, non-Ruby language. – Ian Boyd Nov 24 '09 at 16:16
  • What do you mean by 'repeating structures of data'? Do you mean that you have a list of `vehicleInfo` divs on your page, and you want to extract the `carlink`s of each div? – int3 Nov 24 '09 at 17:26
  • personally, i think you should retag the question to C++ or C ... in many (if not most) high level languages, what you want to do, is a matter of few lines and the right library for HTML parsing ... or writing a general DOM traversal library, is a matter of about 100 ... the difference between regular expression, and things like XPath queries, E4X expression and jQuery selectors is, is that the latter ones take advantage of HTMLs semantics, whereas the first one only operates on strings. – back2dos Nov 24 '09 at 17:26
  • @int3: Yes. (padding to make it 15 characters) – Ian Boyd Nov 26 '09 at 15:03
  • Since you found the answer (IHTMLDOMDocument2) I think your new venture deserves another question which is simply an XPath/DOM tree walk problem. This way you can avoid the confusion which new visitors experience when they see a huge question with many followup edits. – Sedat Kapanoglu Nov 26 '09 at 15:34
  • @ssg: That assumes that the final solution involves IHTMLDOMDocument2, or a DOM tree altogether. No, DOM is a means, but not the ends. And the question still stands. – Ian Boyd Nov 26 '09 at 16:02
  • Just embed Perl in your program. ... DONE. – Brad Gilbert Nov 27 '09 at 15:52
  • 1
    You're extremely lucky that the author of the page was so good about naming divs in a way that reflects content, not presentation. Even with your constraints, that makes the problem orders of magnitude easier. – Stephen Harmon Nov 28 '09 at 14:52

12 Answers12

8

Python:

lxml - faster, perhaps better at parsing bad HTML

BeautifulSoup - if lxml fails on your input try this.

Ruby: (heard of the following libraries, but never tried them)

Nokogiri

hpricot

Though if your parsers choke, and you can roughly pinpoint what is causing the choking, I frankly think it's okay to use a regex hack to remove that portion before passing it to the parser.

If you do decide on using lxml, here are some XPath tutorials that you may find useful. The lxml tutorials kind of assume that you know what XPath is (which I didn't when I first read them.)

Edit: Your post has really grown since it first came out... I'll try to answer what I can.

i don't think XPath can select higher level nodes based on criteria of lower level nodes:

It can. Try //div[@class='vehicleInfo']/parent::div[@class='used_result_container']. Use ancestor if you need to go up more levels. lxml also provides a getparent() method on its search results, and you could use that too. Really, you should look at the XPath sites I linked; you can probably solve your problems from there.

how then do you access repeating structures of data?

It would seem that DOM queries are exactly suited to your needs. XPath queries return you a list of the elements found -- what more could you want? And despite its name, lxml does accept 'loose HTML'. Moreover, the parser recognizes the 'sign-posts' in the HTML and structures the whole document accordingly, so you don't have to do it yourself.

Yes, you are still have to do a search on the structure, but at a higher level of abstraction. If the site designers decide to do a page overhaul and completely change the names and structure of their divs, then that's too bad, you have to rewrite your queries, but it should take less time than rewriting your regex. Nothing will do it automatically for you, unless you want to write some AI capabilities into your page-scraper...

I apologize for not providing 'native Win32' libraries, I'd assumed at first that you simply meant 'runs on Windows'. But the others have answered that part.

int3
  • 12,861
  • 8
  • 51
  • 80
5

Use Html Agility Pack for .NET

Update

Since you need something native/antique, and the markup is likely bad, I would recommend running the markup through Tidy and then parsing it with Xerces

Josh Stodola
  • 81,538
  • 47
  • 180
  • 227
  • 2
    I was writing the same - only I prefixed if with "You don't specify your development tool(s) of choice... however you have specified windows so if you use .NET then:" – Murph Nov 24 '09 at 15:07
  • i did not specify the compiler, but i did specify **native** Win32. Let's pretend it's C++. – Ian Boyd Nov 24 '09 at 16:03
5

Native Win32

You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE's DOM parser!).

Frank Krueger
  • 69,552
  • 46
  • 163
  • 208
  • i've used the IHtmlDocument2 in the past. i also have source code for an object that can parse invalid HTML and turn it into a DOM. Any ideas then on how to walk a DOM tree, and repeating structures? – Ian Boyd Nov 24 '09 at 16:18
  • @Ian, the point is that IHtmlDocument2 will be able to handle HTML from the wild - mangled as it is. Walking the DOM is as easy as calling `all` and working with elements (the DOM is hierarchical). It's not fun, but if you want to stay native, this is an "easy" solution. http://msdn.microsoft.com/en-us/library/aa752582(VS.85).aspx – Frank Krueger Nov 25 '09 at 00:43
3

Use Beautiful Soup.

Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. There's also a Ruby port called Rubyful Soup.

Dominic Rodger
  • 97,747
  • 36
  • 197
  • 212
2

If you are really under Win32 you can use a tiny and fast COM object to do it

example code with vbs:

Set dom = CreateObject("htmlfile")
dom.write("<div>Click for <img src='http://www.google.com/images/srpr/logo1w.png'>Google</a></div>")
WScript.Echo(dom.Images.item(0).src)

You can also do this in JScript, or VB/Dephi/C++/C#/Python etc on Windows. It use mshtml.dll dom layout and parser directly.

est
  • 11,429
  • 14
  • 70
  • 118
0

Use a DOM parser

e.g. for java check this list

Open Source HTML Parsers in Java (I like to use cobra)

Or if you are sure e.g. that you only want to parse a certain subset of your html which ideally is also xml valid you could use some xml parser to parse only fragment you pass it in and then even use xpath to request the values your are interested in.

Open Source XML Parsers in Java (e.g. dom4j is easy to use)

jitter
  • 53,475
  • 11
  • 111
  • 124
0

The alternative is to use an html dom parser. Unfortunately, it seems like most of them have problems with poorly formed html, so in addition you need to run it through html tidy or something similar first.

Rob
  • 8,042
  • 3
  • 35
  • 37
  • I think we're looking for specific recommendations of specific parsers and tidiers. – MarkJ Nov 24 '09 at 15:01
  • Thanks, he hadn't provided any specifics at the time, but it seems he's added information. – Rob Nov 24 '09 at 17:03
0

If a DOM parser is out of the question - for whatever reason, I'd go for some variant of PHP's explode() or whatever is available in the programming language that you use.

You could for example start out by splitting by <div class="vehicleInfo">, which would give you each result (remember to ignore the first place). After that you could loop the results split each result by <div class="makemodeltrim"> etc.

This is by no means an optimal solution, and it will be quite fragile (almost any change in the layout of the document would break the code).

Another option would be to go after some CSS selector library like phpQuery or similar for your programming language.

phidah
  • 5,794
  • 6
  • 37
  • 58
0

I think libxml2, despite its name, also does its best to parse tag soup HTML. It is a C library, so it should satisfy your requirements. You can find it here.

BTW, another answer recommended lxml, which is a Python library, but is actually built on libxml2. If lxml worked well for him, chances are libxml2 is going to work well for you.

LaC
  • 12,624
  • 5
  • 39
  • 38
0

How about using Internet Explorer as an ActiveX control? It will give you a fully rendered structure as it viewed the page.

Epsilon Prime
  • 4,576
  • 5
  • 31
  • 34
0

The HTML::Parser and HTML::Tree modules in Perl are pretty good at parsing most typical so-called HTML on the web. From there, you can locate elements using XPath-like queries.

Randal Schwartz
  • 39,428
  • 4
  • 43
  • 70
0

What do you think about ihtmldocument2, I think it should help.

user160820
  • 14,866
  • 22
  • 67
  • 94
  • If you could post some code it would go a long way to help me understanding how a stylesheet transform could help. – Ian Boyd Nov 26 '09 at 15:07