1

I'm trying to parse search results for WorldCat.org in order to fetch basic information about books and articles.

A typical search result (and the one I'm using for testing) can be found here: http://www.worldcat.org/search?q=ti%3Aorganizations&fq=dt%3Abks&qt=advanced&dblist=638

The html for that page is here: http://pastebin.com/w2U91F1i

Here is the regular expression I'm using with PHP preg_match_all to capture basic details about each entry:

$data = file_get_contents($url);
preg_match_all('/<div class="oclc_number">(.*?)<\/div>\n.*?<div class="name">\n.*?<a href="(.*?)"><strong>(.*?)<\/strong><\/a>\n.*?\n\n<div class="author">by\s(.*?)<\/div><div class="type">.*?<span class=\'itemType\'>(.*?)<\/span>.*?\n.*?<span class="itemLanguage">(.*?)<\/span>.*?<div class="type">Publication:\s*?(.*?)<\/div>/', $data, $topics, PREG_SET_ORDER);

When I use this expression with the regexr tool (http://gskinner.com/RegExr/) it works just fine (except I use \r instead of \n -- usually \r doesn't work for me). But preg_match_all gives me an empty array each time.

Any clues as to what I'm doing wrong?

tchaymore
  • 3,728
  • 13
  • 55
  • 86
  • 4
    You're using regular expressions to parse HTML. – Ignacio Vazquez-Abrams Nov 23 '10 at 00:25
  • 2
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Phil Nov 23 '10 at 00:26
  • 1
    @Ignacio short and sweet, gotta love it, but not too helpful. – tchaymore Nov 23 '10 at 00:31
  • @Phil Brown thanks for the link, very helpful. – tchaymore Nov 23 '10 at 00:32
  • @Ignacio: That’s not nuanced enough. The problem is that he’s trying to parse generic and complex HTML not of his devising, and his regex skills just aren’t up to understanding whatever fixed structure may occur there. Given those problem constraints, the simplest approach is to use code somebody has already written and which is known to work. [So even though you can](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491), you shouldn't. Anybody who still writes regexes that look like noise shouldn't be using them at all, really. – tchrist Nov 23 '10 at 00:51

2 Answers2

3

Whenever I need to scrape HTML, I tend to use the Simple HTML DOM Parser library, which takes an HTML tree and parses it into a traversable PHP object, which you can query something like JQuery.

enobrev
  • 22,314
  • 7
  • 42
  • 53
-1

HTML is not a regular language, don't try to parse it with regular expressions!

Read the first answer here:

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Dan Grossman
  • 51,866
  • 10
  • 112
  • 101
  • Wrong. Regular expressions aren't. The reason not to try has nothing to do with whether you can. It has to do with how much trouble it is. You can parse anything with modern regular expressions because they are not SCHOOLBOY REGULAR. But just because you can, does not mean you should. Use somebody else's work instead. Regexes are perfectly fine for known HTML; in fact, they’re often optimal. It’s only general random HTML that there can be a problem with. Stop parroting. – tchrist Nov 23 '10 at 00:41
  • You responded to something I never said. You're a troll. – Dan Grossman Nov 23 '10 at 01:39
  • You said "don't use regular expressions because it's not regular". That's a really dumb thing to say; in fact it sounds rather troglodytal to me. – tchrist Nov 23 '10 at 01:56