HTML Scraping in Php

Question

I've been doing some HTML scraping in PHP using regular expressions. This works, but the result is finicky and fragile. Has anyone used any packages that provide a more robust solution? A config driven solution would be ideal, but I'm not picky.

Have a look at [this](http://stackoverflow.com/questions/26947/how-to-implement-a-web-scraper-in-php#27109) thread - the question goes into a similar direction — crono, Aug 29 '08 at 08:16

score 28 · Accepted Answer · answered Aug 29 '08 at 07:55

28

I would recomend PHP Simple HTML DOM Parser after you have scraped the HTML from the page. It supports invalid HTML, and provides a very easy way to handle HTML elements.

answered Aug 29 '08 at 07:55

Espo

41,399
21
132
159

8

Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Oct 10 '11 at 15:08
can you give me example to click on any link on a given page? – sagar junnarkar Nov 12 '13 at 07:51

score 5 · Answer 2 · answered Jul 31 '09 at 19:43

I would also recommend 'Simple HTML DOM Parser.' It is a good option particularly if your familiar with jQuery or JavaScript selectors then you will find yourself at home.

I have even blogged about it in the past.

score 5 · Answer 3 · answered Aug 29 '08 at 08:01

5

If the page you're scraping is valid X(HT)ML, then any of PHP's built-in XML parsers will do.

I haven't had much success with PHP libraries for scraping. If you're adventurous though, you can try simplehtmldom. I'd recommend Hpricot for Ruby or Beautiful Soup for Python, which are both excellent parsers for HTML.

answered Aug 29 '08 at 08:01

John Douthat

40,711
10
69
66

If you're going to be parsing particularly sloppy HTML, make sure you don't use BeautifulSoup 3.1.x (use 3.0.x). 3.1.x uses htmllib as its parser, which is much less forgiving than 3.0.x's use of sgmllib. – Tom Mar 18 '09 at 01:33

BlaM · Answer 4 · 2014-01-24T12:45:30.957

5

I had some fun working with htmlSQL, which is not so much a high end solution, but really simple to work with.

edited Jan 24 '14 at 12:45

answered Aug 29 '08 at 09:40

BlaM

28,465
32
91
105

late comment but I just found your answer via google.. i like it! :) – Ben Aug 17 '10 at 06:53
Does it work for you even now? It does not seem to work for me... – Dinesh Jan 23 '14 at 16:31

score 3 · Answer 5 · answered Dec 27 '08 at 09:11

3

Using PHP for HTML scraping, I'd recommend cURL + regexp or cURL + some DOM parsers though I personally use cURL + regexp. If you have a profound taste of regexp, it's actually more accurate sometimes.

answered Dec 27 '08 at 09:11

datasn.io

12,564
28
113
154

score 2 · Answer 6 · answered Aug 29 '08 at 08:08

2

I've had very good with results with the Simple Html DOM Parser mentioned above as well. And then there's the tidy Extension for PHP as well which works really well too.

answered Aug 29 '08 at 08:08

Jan Gorman

1,004
2
11
16

score 2 · Answer 7 · answered Dec 02 '10 at 06:51

2

I had to use curl on my host 1and1.

http://www.quickscrape.com/ is what I came up with using the Simple DOM class!

answered Dec 02 '10 at 06:51

Steve

21
1

HTML Scraping in Php

7 Answers7

Linked

Related