64

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.

Kara
  • 6,115
  • 16
  • 50
  • 57
KJ Saxena
  • 21,452
  • 24
  • 81
  • 109

15 Answers15

89

Meh. Don't parse HTML with regexes.

Here's a DOM version inspired by Tatu's:

<?php
function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1).$path;
            }
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

Edit: I fixed some bugs from Tatu's version (works with relative URLs now).

Edit: I added a new bit of functionality that prevents it from following the same URL twice.

Edit: echoing output to STDOUT now so you can redirect it to whatever file you want

Edit: Fixed a bug pointed out by George in his answer. Relative urls will no longer append to the end of the url path, but overwrite it. Thanks to George for this. Note that George's answer doesn't account for any of: https, user, pass, or port. If you have the http PECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.

hobodave
  • 28,925
  • 4
  • 72
  • 77
  • 1
    Can I recommend using curl to fetch the page then manipulate/traverse using the DOM library. If you're doing this frequently curl is much better option imo. – Ben Shelock Mar 18 '10 at 16:46
  • I get the SSL error: DOMDocument::loadHTMLFile(): SSL operation failed with code 1. DOMDocument::loadHTMLFile(): Failed to enable crypto in /var/www/7Cups.com/parser.php on line 10. failed to open stream: operation failed. DOMDocument::loadHTMLFile(): I/O warning : failed to load external entity – Zoka May 25 '18 at 15:17
15

Here my implementation based on the above example/answer.

  1. It is class based
  2. uses Curl
  3. support HTTP Auth
  4. Skip Url not belonging to the base domain
  5. Return Http header Response Code for each page
  6. Return time for each page

CRAWL CLASS:

class crawler
{
    protected $_url;
    protected $_depth;
    protected $_host;
    protected $_useHttpAuth = false;
    protected $_user;
    protected $_pass;
    protected $_seen = array();
    protected $_filter = array();

    public function __construct($url, $depth = 5)
    {
        $this->_url = $url;
        $this->_depth = $depth;
        $parse = parse_url($url);
        $this->_host = $parse['host'];
    }

    protected function _processAnchors($content, $url, $depth)
    {
        $dom = new DOMDocument('1.0');
        @$dom->loadHTML($content);
        $anchors = $dom->getElementsByTagName('a');

        foreach ($anchors as $element) {
            $href = $element->getAttribute('href');
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($url, array('path' => $path));
                } else {
                    $parts = parse_url($url);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            // Crawl only link that belongs to the start domain
            $this->crawl_page($href, $depth - 1);
        }
    }

    protected function _getContent($url)
    {
        $handle = curl_init($url);
        if ($this->_useHttpAuth) {
            curl_setopt($handle, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
            curl_setopt($handle, CURLOPT_USERPWD, $this->_user . ":" . $this->_pass);
        }
        // follows 302 redirect, creates problem wiht authentication
//        curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);
        // return the content
        curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

        /* Get the HTML or whatever is linked in $url. */
        $response = curl_exec($handle);
        // response total time
        $time = curl_getinfo($handle, CURLINFO_TOTAL_TIME);
        /* Check for 404 (file not found). */
        $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);

        curl_close($handle);
        return array($response, $httpCode, $time);
    }

    protected function _printResult($url, $depth, $httpcode, $time)
    {
        ob_end_flush();
        $currentDepth = $this->_depth - $depth;
        $count = count($this->_seen);
        echo "N::$count,CODE::$httpcode,TIME::$time,DEPTH::$currentDepth URL::$url <br>";
        ob_start();
        flush();
    }

    protected function isValid($url, $depth)
    {
        if (strpos($url, $this->_host) === false
            || $depth === 0
            || isset($this->_seen[$url])
        ) {
            return false;
        }
        foreach ($this->_filter as $excludePath) {
            if (strpos($url, $excludePath) !== false) {
                return false;
            }
        }
        return true;
    }

    public function crawl_page($url, $depth)
    {
        if (!$this->isValid($url, $depth)) {
            return;
        }
        // add to the seen URL
        $this->_seen[$url] = true;
        // get Content and Return Code
        list($content, $httpcode, $time) = $this->_getContent($url);
        // print Result for current Page
        $this->_printResult($url, $depth, $httpcode, $time);
        // process subPages
        $this->_processAnchors($content, $url, $depth);
    }

    public function setHttpAuth($user, $pass)
    {
        $this->_useHttpAuth = true;
        $this->_user = $user;
        $this->_pass = $pass;
    }

    public function addFilterPath($path)
    {
        $this->_filter[] = $path;
    }

    public function run()
    {
        $this->crawl_page($this->_url, $this->_depth);
    }
}

USAGE:

// USAGE
$startURL = 'http://YOUR_URL/';
$depth = 6;
$username = 'YOURUSER';
$password = 'YOURPASS';
$crawler = new crawler($startURL, $depth);
$crawler->setHttpAuth($username, $password);
// Exclude path with the following structure to be processed 
$crawler->addFilterPath('customer/account/login/referer');
$crawler->run();
WonderLand
  • 5,494
  • 7
  • 57
  • 76
11

Check out PHP Crawler

http://sourceforge.net/projects/php-crawler/

See if it helps.

GeekTantra
  • 11,580
  • 6
  • 41
  • 55
9

In it's simplest form:

function crawl_page($url, $depth = 5) {
    if($depth > 0) {
        $html = file_get_contents($url);

        preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);

        foreach($matches[1] as $newurl) {
            crawl_page($newurl, $depth - 1);
        }

        file_put_contents('results.txt', $newurl."\n\n".$html."\n\n", FILE_APPEND);
    }
}

crawl_page('http://www.domain.com/index.php', 5);

That function will get contents from a page, then crawl all found links and save the contents to 'results.txt'. The functions accepts an second parameter, depth, which defines how long the links should be followed. Pass 1 there if you want to parse only links from the given page.

Tatu Ulmanen
  • 123,288
  • 34
  • 187
  • 185
  • -1: Meh to using regexes. Doesn't work with relative urls. Also uses the wrong URL in the file_put_contents(). – hobodave Feb 22 '10 at 19:01
  • What is this supposed to do? I crawled by website and it gave me a bunch of crap. It looks like it gets content from somewhere else but now from my site. – erdomester May 09 '15 at 13:11
5

With some little changes to hobodave's code, here is a codesnippet you can use to crawl pages. This needs the curl extension to be enabled in your server.

<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
    return;
}   
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
    $stripped_file = strip_tags($result, "<a>");
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); 
    foreach($matches as $match){
        $href = $match[1];
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($href , array('path' => $path));
                } else {
                    $parts = parse_url($href);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            crawl_page($href, $depth - 1);
        }
}   
echo "Crawled {$href}";
}   
crawl_page("http://www.sitename.com/",3);
?>

I have explained this tutorial in this crawler script tutorial

Community
  • 1
  • 1
Team Webgalli
  • 730
  • 4
  • 13
5

Why use PHP for this, when you can use wget, e.g.

wget -r -l 1 http://www.example.com

For how to parse the contents, see Best Methods to parse HTML and use the search function for examples. How to parse HTML has been answered multiple times before.

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • Some specific fields have to be parsed and taken out. I will need to write code. – KJ Saxena Feb 22 '10 at 18:35
  • @Crimson that's a requirement you should note in the question then ;) – Gordon Feb 22 '10 at 18:38
  • 7
    @Gordon: "How do I make a simple crawler in PHP?" :-P – hobodave Feb 22 '10 at 18:53
  • @hobodave I meant the part about *having to parse and take out specific fields* :P If it wasn't for this, using wget is the simplest thing I could imagine for this purpose. – Gordon Feb 22 '10 at 19:15
  • This may be the best way to achieve crawling [emphasis on "may"], but in no way does it answer the stated question. – Lightness Races in Orbit May 04 '11 at 00:20
  • @Tomalak see my comment above. The question is vague at best. Also, how to parse HTML has been answered dozens of times before and it makes no sense to reiterate the obvious. I've updated my answer with a link to additional information. This has to suffice. – Gordon May 04 '11 at 06:50
  • "This has been answered before" is not an answer, either. – Lightness Races in Orbit May 04 '11 at 08:44
  • @Tomalak if you think my answer is not an answer, flag it as such. If the OP asks for how to parse HTML, then it's a duplicate and shouldnt be answered at all but closevoted. I've answered how to **crawl** a page, which is what is asked in the title. I wont answer how to parse again. – Gordon May 04 '11 at 09:11
  • @Gordon: You did not answer how to crawl a page in PHP. – Lightness Races in Orbit May 04 '11 at 09:43
  • @Tomalak seriously, first you comment (quoting) "This *may* be the best way to achieve crawling" then you tell me "you did not answer how to crawl a page in php". I'd appreciate if you could make up your mind **before** getting on other people's nerves. Is that possible? kthxbai. – Gordon May 04 '11 at 09:50
  • @Gordon: Can you not read? This may indeed be the best way to achieve crawling. It is not a way to achieve crawling _in PHP_. Sorry to be rude, but I feel like I must be missing something obvious here...?? Read: "this may be the best way to achieve crawling" in the wider, general case, "but in no way does it answer the stated question" which is about PHP. – Lightness Races in Orbit May 04 '11 at 10:03
  • 1
    @Tomalak You might indeed be missing something obvious here. Yes, I did not answer how to crawl a page *with PHP*. If you look at my answer, you'll see I actually state that as the first thing. I gave an alternative which I deem more practical, which is something I'd expect someone that claims to *strike a balance between "answering the actual question" and "giving the OP the solution that he actually needs"* to understand. I also gave two links to information on how to parse HTML for data. If that's not good enough for you, keep your dv and/or flag it. I dont care. – Gordon May 04 '11 at 10:15
3

Hobodave you were very close. The only thing I have changed is within the if statement that checks to see if the href attribute of the found anchor tag begins with 'http'. Instead of simply adding the $url variable which would contain the page that was passed in you must first strip it down to the host which can be done using the parse_url php function.

<?php
function crawl_page($url, $depth = 5)
{
  static $seen = array();
  if (isset($seen[$url]) || $depth === 0) {
    return;
  }

  $seen[$url] = true;

  $dom = new DOMDocument('1.0');
  @$dom->loadHTMLFile($url);

  $anchors = $dom->getElementsByTagName('a');
  foreach ($anchors as $element) {
    $href = $element->getAttribute('href');
    if (0 !== strpos($href, 'http')) {
       /* this is where I changed hobodave's code */
        $host = "http://".parse_url($url,PHP_URL_HOST);
        $href = $host. '/' . ltrim($href, '/');
    }
    crawl_page($href, $depth - 1);
  }

  echo "New Page:<br /> ";
  echo "URL:",$url,PHP_EOL,"<br />","CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL,"  <br /><br />";
}

crawl_page("http://hobodave.com/", 5);
?>
George
  • 4,323
  • 3
  • 30
  • 33
  • 2
    Thanks for pointing out my bug George! Your solution neglects to handle https, user, pass, and port. I've updated my answer to address the bug you found, as well as the bugs introduced by yours. Thanks again! – hobodave May 04 '11 at 03:16
2

As mentioned, there are crawler frameworks all ready for customizing out there, but if what you're doing is as simple as you mentioned, you could make it from scratch pretty easily.

Scraping the links: http://www.phpro.org/examples/Get-Links-With-DOM.html

Dumping results to a file: http://www.tizag.com/phpT/filewrite.php

Jens Roland
  • 27,450
  • 14
  • 82
  • 104
1

I used @hobodave's code, with this little tweak to prevent re-crawling all fragment variants of the same URL:

<?php
function crawl_page($url, $depth = 5)
{
  $parts = parse_url($url);
  if(array_key_exists('fragment', $parts)){
    unset($parts['fragment']);
    $url = http_build_url($parts);
  }

  static $seen = array();
  ...

Then you can also omit the $parts = parse_url($url); line within the for loop.

Anders
  • 8,307
  • 9
  • 56
  • 88
pasqal
  • 159
  • 1
  • 6
1

You can try this it may be help to you

$search_string = 'american golf News: Fowler beats stellar field in Abu Dhabi';
$html = file_get_contents(url of the site);
$dom = new DOMDocument;
$titalDom = new DOMDocument;
$tmpTitalDom = new DOMDocument;
libxml_use_internal_errors(true);
@$dom->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$videos = $xpath->query('//div[@class="primary-content"]');
foreach ($videos as $key => $video) {
$newdomaindom = new DOMDocument;    
$newnode = $newdomaindom->importNode($video, true);
$newdomaindom->appendChild($newnode);
@$titalDom->loadHTML($newdomaindom->saveHTML());
$xpath1 = new DOMXPath($titalDom);
$titles = $xpath1->query('//div[@class="listingcontainer"]/div[@class="list"]');
if(strcmp(preg_replace('!\s+!',' ',  $titles->item(0)->nodeValue),$search_string)){     
    $tmpNode = $tmpTitalDom->importNode($video, true);
    $tmpTitalDom->appendChild($tmpNode);
    break;
}
}
echo $tmpTitalDom->saveHTML();
Niraj patel
  • 525
  • 4
  • 12
1

Thank you @hobodave.

However I found two weaknesses in your code. Your parsing of the original url to get the "host" segment stops at the first single slash. This presumes that all relative links start in the root directory. This only true sometimes.

original url   :  http://example.com/game/index.html
href in <a> tag:  highscore.html
author's intent:  http://example.com/game/highscore.html  <-200->
crawler result :  http://example.com/highscore.html       <-404->

fix this by breaking at the last single slash not the first

a second unrelated bug, is that $depth does not really track recursion depth, it tracks breadth of the first level of recursion.

If I believed this page were in active use I might debug this second issue, but I suspect the text I am writing now will never be read by anyone, human or robot, since this issue is six years old and I do not even have enough reputation to notify +hobodave directly about these defects by commmenting on his code. Thanks anyway hobodave.

0

I came up with the following spider code. I adapted it a bit from the following: PHP - Is the there a safe way to perform deep recursion? it seems fairly rapid....

    <?php
function  spider( $base_url , $search_urls=array() ) {
    $queue[] = $base_url;
    $done           =   array();
    $found_urls     =   array();
    while($queue) {
            $link = array_shift($queue);
            if(!is_array($link)) {
                $done[] = $link;
                foreach( $search_urls as $s) { if (strstr( $link , $s )) { $found_urls[] = $link; } }
                if( empty($search_urls)) { $found_urls[] = $link; }
                if(!empty($link )) {
echo 'LINK:::'.$link;
                      $content =    file_get_contents( $link );
//echo 'P:::'.$content;
                    preg_match_all('~<a.*?href="(.*?)".*?>~', $content, $sublink);
                    if (!in_array($sublink , $done) && !in_array($sublink , $queue)  ) {
                           $queue[] = $sublink;
                    }
                }
            } else {
                    $result=array();
                    $return = array();
                    // flatten multi dimensional array of URLs to one dimensional.
                    while(count($link)) {
                         $value = array_shift($link);
                         if(is_array($value))
                             foreach($value as $sub)
                                $link[] = $sub;
                         else
                               $return[] = $value;
                     }
                     // now loop over one dimensional array.
                     foreach($return as $link) {
                                // echo 'L::'.$link;
                                // url may be in form <a href.. so extract what's in the href bit.
                                preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $link, $result);
                                if ( isset( $result['href'][0] )) { $link = $result['href'][0]; }
                                // add the new URL to the queue.
                                if( (!strstr( $link , "http")) && (!in_array($base_url.$link , $done)) && (!in_array($base_url.$link , $queue)) ) {
                                     $queue[]=$base_url.$link;
                                } else {
                                    if ( (strstr( $link , $base_url  ))  && (!in_array($base_url.$link , $done)) && (!in_array($base_url.$link , $queue)) ) {
                                         $queue[] = $link;
                                    }
                                }
                      }
            }
    }


    return $found_urls;
}    


    $base_url       =   'https://www.houseofcheese.co.uk/';
    $search_urls    =   array(  $base_url.'acatalog/' );
    $done = spider( $base_url  , $search_urls  );

    //
    // RESULT
    //
    //
    echo '<br /><br />';
    echo 'RESULT:::';
    foreach(  $done as $r )  {
        echo 'URL:::'.$r.'<br />';
    }
Ian
  • 900
  • 2
  • 9
  • 19
0

Its worth remembering that when crawling external links (I do appreciate the OP relates to a users own page) you should be aware of robots.txt. I have found the following which will hopefully help http://www.the-art-of-web.com/php/parse-robots/.

Antony
  • 3,875
  • 30
  • 32
0

I created a small class to grab data from the provided url, then extract html elements of your choice. The class makes use of CURL and DOMDocument.

php class:

class crawler {


   public static $timeout = 2;
   public static $agent   = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';


   public static function http_request($url) {
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_URL,            $url);
      curl_setopt($ch, CURLOPT_USERAGENT,      self::$agent);
      curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, self::$timeout);
      curl_setopt($ch, CURLOPT_TIMEOUT,        self::$timeout);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
      $response = curl_exec($ch);
      curl_close($ch);
      return $response;
   }


   public static function strip_whitespace($data) {
      $data = preg_replace('/\s+/', ' ', $data);
      return trim($data);
   }


   public static function extract_elements($tag, $data) {
      $response = array();
      $dom      = new DOMDocument;
      @$dom->loadHTML($data);
      foreach ( $dom->getElementsByTagName($tag) as $index => $element ) {
         $response[$index]['text'] = self::strip_whitespace($element->nodeValue);
         foreach ( $element->attributes as $attribute ) {
            $response[$index]['attributes'][strtolower($attribute->nodeName)] = self::strip_whitespace($attribute->nodeValue);
         }
      }
      return $response;
   }


}

example usage:

$data  = crawler::http_request('https://stackoverflow.com/questions/2313107/how-do-i-make-a-simple-crawler-in-php');
$links = crawler::extract_elements('a', $data);
if ( count($links) > 0 ) {
   file_put_contents('links.json', json_encode($links, JSON_PRETTY_PRINT));
}

example response:

[
    {
        "text": "Stack Overflow",
        "attributes": {
            "href": "https:\/\/stackoverflow.com",
            "class": "-logo js-gps-track",
            "data-gps-track": "top_nav.click({is_current:false, location:2, destination:8})"
        }
    },
    {
        "text": "Questions",
        "attributes": {
            "id": "nav-questions",
            "href": "\/questions",
            "class": "-link js-gps-track",
            "data-gps-track": "top_nav.click({is_current:true, location:2, destination:1})"
        }
    },
    {
        "text": "Developer Jobs",
        "attributes": {
            "id": "nav-jobs",
            "href": "\/jobs?med=site-ui&ref=jobs-tab",
            "class": "-link js-gps-track",
            "data-gps-track": "top_nav.click({is_current:false, location:2, destination:6})"
        }
    }
]
TURTLE
  • 3,728
  • 4
  • 49
  • 50
0

It's an old question. A lot of good things happened since then. Here are my two cents on this topic:

  1. To accurately track the visited pages you have to normalize URI first. The normalization algorithm includes multiple steps:

    • Sort query parameters. For example, the following URIs are equivalent after normalization: GET http://www.example.com/query?id=111&cat=222 GET http://www.example.com/query?cat=222&id=111
    • Convert the empty path. Example: http://example.org → http://example.org/

    • Capitalize percent encoding. All letters within a percent-encoding triplet (e.g., "%3A") are case-insensitive. Example: http://example.org/a%c2%B1b → http://example.org/a%C2%B1b

    • Remove unnecessary dot-segments. Example: http://example.org/../a/b/../c/./d.html → http://example.org/a/c/d.html

    • Possibly some other normalization rules

  2. Not only <a> tag has href attribute, <area> tag has it too https://html.com/tags/area/. If you don't want to miss anything, you have to scrape <area> tag too.

  3. Track crawling progress. If the website is small, it is not a problem. Contrarily it might be very frustrating if you crawl half of the site and it failed. Consider using a database or a filesystem to store the progress.

  4. Be kind to the site owners. If you are ever going to use your crawler outside of your website, you have to use delays. Without delays, the script is too fast and might significantly slow down some small sites. From sysadmins perspective, it looks like a DoS attack. A static delay between the requests will do the trick.

If you don't want to deal with that, try Crawlzone and let me know your feedback. Also, check out the article I wrote a while back https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm

zstate
  • 1,995
  • 1
  • 18
  • 20