2

Hello I am building a database of factual data about my book collection, i.e. titles, number of pages, width, length, author, author birthdate, publisher name, publisher address, and so on. For that purpose, I input ISBNs and the application fetches that info from the web. From a few sites I defined myself, that I know among them will have all the info I require. At the current moment, it's 3 sites, and it will most probably never be more than five. On each of these sites, I CURL a search page with the isbn as a query parameter, extract the links the search page presents, then CURL these links and extract the above info (birth, title, publisher, etc...) out of them. The extent of my scraping, therefore, is 3 x (search page + info page) = 6 HTML pages.

These pages all present relevant information in ludicrous ways. For example the publisher info has address, phone, email, website in one HTML tag, with brs as separators. Some publishers don't have one of these fields, therefore it's not even always the same number of brs. Another of these sites has lis for most of the info, but a for one field, p for another, and div for another. Etc...

I have succesfully extracted what I wanted with regex, then with a DOM parser. In the end, the readability of the code is way worse with the DOM parser, as more operations are needed for extracting a field of info. As an example:

<li>Né le : 23/12/1990 (ANGLETERRE)</li>

for a male author birthdate, could also show up for a female one as

<li>Née le : 11/07/1832</li>

With the DOM parser, I need to get a list of lis, which is not enough, as some important info is in a p, a div, and a a. Then for each li, I need to check if the li contains "Né le" or "Née le", which is either to ifs, or a regex - the to check if there is a parenthetized birthplace, and extract it, which is at least two more operations. With a regex, I can get it in one line of code.

Moreover, how exactly is a parser built? Does the underlying code do regexes, or is it something else? If it is so, I figure there is a high performance cost, when using a parsing engine, vs. quick and dirty regexes?

So here are my two interrogations, how is a DOM parser built, is it with underlying regexes? And secondly, for my very limited scope of parsing six to ten pages, mostly for my personal use, shouldn't I go for code readability (and performance depending on the first question)?

Best regards, Sebastian

pouzzler
  • 1,800
  • 2
  • 20
  • 32
  • You probably haven't seen [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) yet. – BoltClock Sep 15 '12 at 17:44
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – singpolyma Sep 15 '12 at 17:45
  • 1
    @BoltClock - For limited use like the OP is describing (well known HTML structure), regex is fine. – Oded Sep 15 '12 at 17:52
  • @Oded: I know that. I was referring to what the OP mentioned about using regex to implement parsers. – BoltClock Sep 15 '12 at 17:55

1 Answers1

6

how is a DOM parser built, is it with underlying regexes?

It is a parser and normally would not be implemented with regex. Internally one would go through each character of the HTML at at time and use a state machine to "figure out" what the character means and how it fits into the DOM (this will include fixing broken HTML, closing elements that should be closed and more).

If you can read C# (or Java), I suggest reading the source code for the HTML Agility Pack - in particular the Parse methods. It will show quite clearly how this is done.

The definite source for how to correctly parse HTML is in section 12.2 of the whatwg HTML specification - (note that the link is to the first page only - there is more). This is not for the feint of heart ;)

for my very limited scope of parsing six to ten pages, mostly for my personal use, shouldn't I go for code readability (and performance depending on the first question)?

Regex for parsing well known HTML formats is fine. People rage against trying to parse HTML from many different sources with regex, as this is not really possible (HTML not being a regular language, you end up with many exceptions and contradictions).

If this is for a limited use and limited HTML formats, go ahead and use regex. Do whatever is more readable for you.

carla
  • 1,970
  • 1
  • 31
  • 44
Oded
  • 489,969
  • 99
  • 883
  • 1,009