Hello I am building a database of factual data about my book collection, i.e. titles, number of pages, width, length, author, author birthdate, publisher name, publisher address, and so on. For that purpose, I input ISBNs and the application fetches that info from the web. From a few sites I defined myself, that I know among them will have all the info I require. At the current moment, it's 3 sites, and it will most probably never be more than five. On each of these sites, I CURL a search page with the isbn as a query parameter, extract the links the search page presents, then CURL these links and extract the above info (birth, title, publisher, etc...) out of them. The extent of my scraping, therefore, is 3 x (search page + info page) = 6 HTML pages.
These pages all present relevant information in ludicrous ways. For example the publisher info has address, phone, email, website in one HTML tag, with brs as separators. Some publishers don't have one of these fields, therefore it's not even always the same number of brs. Another of these sites has lis for most of the info, but a for one field, p for another, and div for another. Etc...
I have succesfully extracted what I wanted with regex, then with a DOM parser. In the end, the readability of the code is way worse with the DOM parser, as more operations are needed for extracting a field of info. As an example:
<li>Né le : 23/12/1990 (ANGLETERRE)</li>
for a male author birthdate, could also show up for a female one as
<li>Née le : 11/07/1832</li>
With the DOM parser, I need to get a list of lis, which is not enough, as some important info is in a p, a div, and a a. Then for each li, I need to check if the li contains "Né le" or "Née le", which is either to ifs, or a regex - the to check if there is a parenthetized birthplace, and extract it, which is at least two more operations. With a regex, I can get it in one line of code.
Moreover, how exactly is a parser built? Does the underlying code do regexes, or is it something else? If it is so, I figure there is a high performance cost, when using a parsing engine, vs. quick and dirty regexes?
So here are my two interrogations, how is a DOM parser built, is it with underlying regexes? And secondly, for my very limited scope of parsing six to ten pages, mostly for my personal use, shouldn't I go for code readability (and performance depending on the first question)?
Best regards, Sebastian