-2

I've seen this question, which is very nice and informative. However, it doesn't deal with a rather common scenario.

Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.

How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.

Is it possible? Do I have to revert to RegExp?

bouteillebleu
  • 2,456
  • 23
  • 32
Madara's Ghost
  • 172,118
  • 50
  • 264
  • 308
  • Is the problem that you would like to use PHP DOM manipulation tools on the parsed HTML, but with the HTML being malformed, you are having difficulty doing this? – Mike Brant Jul 18 '12 at 17:23
  • Personally, I have always wondered how the browser makers do it and live to tell the tale. – BoltClock Jul 18 '12 at 17:23
  • Perhaps try `strip_tags` then use regex to find the remaining tags? – Kermit Jul 18 '12 at 17:23

3 Answers3

4

You need a DOM Parser. Php has one. And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.

RobIII
  • 8,488
  • 2
  • 43
  • 93
  • You don't say! But it's broken HTML we're talking about. This isn't valid HTML. – Madara's Ghost Jul 18 '12 at 17:24
  • 6
    @Truth So? "Broken" (malformed, invalid, whatever...) HTML is just as parseable... Maybe the DOM tree isn't as the author intended but you can perfectly fine access all the required nodes/attributes you need. – RobIII Jul 18 '12 at 17:24
  • RobIII is right. Have a look at the `loadHTML` method http://www.php.net/manual/en/domdocument.loadhtml.php – Dio F Jul 18 '12 at 17:29
0

I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.

You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.

Baptiste Placé
  • 386
  • 2
  • 5
  • Have a look at [my all-time favourite StackOverflow answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) on using Regexes to "parse" HTML. And then read [Parsing Html The Cthulhu Way](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) for some nuance. For me, the biggest reason **NOT** to use a regex is because a regex in itself [quickly becomes "unreadable"](http://ex-parrot.com/~pdw/Mail-RFC822-Address.html). – RobIII Jul 18 '12 at 17:31
  • Well, I never said you could parse HTML with RegExp, this is a nonsense. Obviously, scraping (ie extracting data) can be done very well with RegExp. Thanks for the reading though ! – Baptiste Placé Jul 19 '12 at 08:08
  • To extract data (and do it *correctly*) you would have to parse the document and use the DOM instead of relying of a big mess of a string. Having said that; yes, indeed, you *could* use a regex. That's why I added Jeff Atwoods article ;-) – RobIII Jul 19 '12 at 09:07
  • I'll get into DOM parsing someday maybe, true it's useful and easy, and you don't have to mess with strings (until you have to parse the text nodes). But it's not like I'm parsing every morning, and I never failed to extract a data with RegExp. More over, I really enjoy training myself with RegExp, this is like a swiss-army-knife for many tasks. – Baptiste Placé Jul 19 '12 at 17:55
0

Does it have to be PHP? Python has a wonderful library called Beautiful Soup ("You didn't write that awful page. You're just trying to get some data out of it"). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.

(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)

anotherdave
  • 6,656
  • 4
  • 34
  • 65