What PHP web crawler libraries are available?

Question

I'm looking for some robust, well documented PHP web crawler scripts. Perhaps a PHP port of the Java project - http://wiki.apache.org/nutch/NutchTutorial

I'm looking for both free and non free versions.

No crawler is going to do the data scraping, that's something you're going to have to write yourself. And also make sure what you're lifting isn't copyrighted. — Richard H, Jan 30 '11 at 10:30
Possible duplicate of [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Jan 30 '11 at 11:36
Additional possible duplicates in http://stackoverflow.com/search?q=web+crawler+php — Gordon, Jan 30 '11 at 11:40
@Jason If you dont need help parsing HTML, then maybe you should clarify what you are after. The crawled HTML will not magically transform itself into the chunks you deem important. It will have to be parsed. Please update your question to point out what you are looking for or at least what you are not looking for. In addition, please go through the linked search results and see if they contain helpful hints. If you still got questions then, point them out in your question as well. In other words: http://stackoverflow.com/questions/ask-advice — Gordon, Jan 30 '11 at 12:12
possible duplicate of [Scraping and Web crawling framework](http://stackoverflow.com/questions/3885760/scraping-and-web-crawling-framework-php/3886030) — Gordon, Jan 30 '11 at 12:55

score 4 · Answer 1 · answered Apr 15 '13 at 09:42

4

https://github.com/fabpot/Goutte is also a good library compatible with psr-0 standard.

answered Apr 15 '13 at 09:42

Ajay Patel

791
11
24

Since this merge ( https://github.com/FriendsOfPHP/Goutte/pull/397 ) Goutte does not add anythging except `Class Client extends HttpBrowser` from symfony. You can directly use symfony HttpBrowser then. – Grzegorz Feb 07 '21 at 09:35

score 4 · Accepted Answer · answered Jan 30 '11 at 12:06

4

Just give Snoopy a try.

Excerpt: "Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example."

answered Jan 30 '11 at 12:06

Mimikry

114
3

4

Sorry man, I know it is a old post but people still read this answer and I downvoted because Snoopy uses Regex to parse HTML and [it's not cool](http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la)... – fernandosavio Mar 17 '16 at 17:42

Harish Ninge Gowda · Answer 3 · 2017-01-03T09:19:32.273

There is a greate tutorial here which combines guzzlehttp and symfony/dom-crawler

In case the link is lost here is the code you can make use.

use Guzzle\Http\Client;
use Symfony\Component\DomCrawler\Crawler;
use RuntimeException;

// create http client instance
$client = new GuzzleHttp\ClientClient('http://download.cloud.com/releases');

// create a request
$response = $client->request('/3.0.6/api_3.0.6/TOC_Domain_Admin.html');

// get status code
$status = $response->getStatusCode();

// this is the response body from the requested page (usually html)
//$result = $response->getBody();

// crate crawler instance from body HTML code
$crawler = new Crawler($response->getBody(true));

// apply css selector filter
$filter = $crawler->filter('div.apismallbullet_box');
$result = array();

if (iterator_count($filter) > 1) {

    // iterate over filter results
    foreach ($filter as $i => $content) {

        // create crawler instance for result
        $cralwer = new Crawler($content);
        // extract the values needed
        $result[$i] = array(
            'topic' => $crawler->filter('h5')->text();
            'className' => trim(str_replace(' ', '', $result[$i]['topic'])) . 'Client'
        );
    }
} else {
    throw new RuntimeException('Got empty result processing the dataset!');
}

Eray · Answer 4 · 2011-01-30T10:56:05.280

2

You can use PHP Simple HTML DOM Parser . It's really simple and useful.

edited Jan 30 '11 at 10:56

answered Jan 30 '11 at 10:48

Eray

7,038
16
70
120

1

Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Jan 30 '11 at 11:37

Kemo · Answer 5 · 2011-01-30T11:50:11.990

2

I've been using Simple HTML DOM for about 3 years before I discovered phpQuery. It's a lot faster, not working recursively (you can actually dump it) and has a full support for jQuery selectors and methods.

edited Jan 30 '11 at 11:50

answered Jan 30 '11 at 10:52

Kemo

6,942
3
32
39

1

@Gordon Nope, they are jQuery selectors. From jQuery.com: "Borrowing from CSS 1–3, and then adding its own, jQuery offers a powerful set of tools for matching a set of elements in a document." – Kemo Jan 30 '11 at 11:49
Hmm, okay. They extend on CSS selectors. I guess that's a valid distinction then. Sorry. I rarely see people use anything that's not in the set of CSS selectors when they talk about *jQuery* selectors. They make it sound like jQuery invented them. – Gordon Jan 30 '11 at 12:05
1

@Gordon yeah, i h8 the "like we invented them" part too :) More info at sizzlejs.com – Kemo Jan 31 '11 at 20:21

score 1 · Answer 6 · answered Mar 29 '13 at 20:01

1

if you are thinking about a strong base component than give a try to http://symfony.com/doc/2.0/components/dom_crawler.html

it is amazing, having a features like css selector.

answered Mar 29 '13 at 20:01

Ajay Patel

791
11
24

zstate · Answer 7 · 2018-12-26T21:52:20.930

I know it is a bit old question. A lot of useful libraries came out since then.

Give it a shot to Crawlzone. It is fast, well documented, asynchronous internet crawling framework with a lot of great features:

Asynchronous crawling with customizable concurrency.
Automatically throttling crawling speed based on the load of the website you are crawling.
If configured, automatically filters out requests forbidden by the robots.txt exclusion standard.
Straightforward middleware system allows you to append headers, extract data, filter or plug any custom functionality to process the request and response.
Rich filtering capabilities.
Ability to set crawling depth
Easy to extend the core by hooking into the crawling process using events.
Shut down crawler any time and start over without losing the progress.

Also check out the article I wrote about it:

https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm

score -3 · Answer 8 · answered Feb 11 '13 at 23:41

-3

Nobody mentioned wget as a good starting point?.

wget -r --level=10 -nd http://www.mydomain.com/

More @ http://www.erichynds.com/ubuntulinux/automatically-crawl-a-website-looking-for-errors/

answered Feb 11 '13 at 23:41

dsomnus

1,391
14
21

What PHP web crawler libraries are available?

8 Answers8