How do I implement a screen scraper in PHP?

Question

I have a user ID and a password to log in to a web site via my program. Once logged in, the URL will change from http://localhost/Test/loginpage.html to http://www.4wtech.com/csp/web/Employee/Login.csp.

How can I "screen scrape" the data from the second URL using PHP?

Please rephrase in terms of an actual question with more information about what the problem is. — Rob, Feb 10 '09 at 13:52
I've done the edits Rob is calling for while trying to leave the "tone" (e.g. "Hi friends") intact. — Mark Brittingham, Feb 10 '09 at 13:54

score 4 · Answer 1 · answered Feb 10 '09 at 13:59

4

You would use Curl. Curl can login to the page, then access the new referred page and download the entire page.

Check out the php manual for curl as well as this tutorial: How to screen-scrape with PHP and Curl.

answered Feb 10 '09 at 13:59

Syntax

1,314
1
7
14

score 3 · Answer 2 · answered Feb 10 '09 at 14:01

3

I'm not quite sure if I understood you question. But if you really do intend screen scraping in PHP I recommend the simple_html_dom parser. That's a small library that will let you use CSS selectors in PHP. To me, screen scraping has never been easier in PHP. Here's an example:

// Create DOM from URL or file
$html = file_get_html('http://stackoverflow.com/');

// Find all links
foreach($html->find('a') as $element) {
       echo $element->href . '<br>';
}

answered Feb 10 '09 at 14:01

cg.

3,648
2
26
30

1

What part of his question did you not understand? I think it is pretty clear. – GEOCHET Feb 10 '09 at 14:03
I mean no offense. When I read the first version of the question, I was not sure if Sakthivel actually meant screen scraping or URL rewriting. – cg. Feb 10 '09 at 14:12

score 0 · Answer 3 · answered Jul 10 '15 at 13:12

Important!

Note that scraping isn't always allowed. If you decide to scrape a page, make sure you are allowed to do that by the owners of that page, or you might end up doing something illegal.

Assuming you are allowed to scrape a page, apply the following steps.

The HTTP Request

First, you make an HTTP request to get the content of the page. There are several ways to do that.

fopen

The most basic way to send an HTTP request, is to use fopen. A main advantage is that you can set how many characters are read at a time, which can be useful when reading very large files. It's not the easiest thing to do correctly, though, and it's not recommended to do this unless you're reading very large files and fear running into memory issues.

$fp = fopen("http://www.4wtech.com/csp/web/Employee/Login.csp", "rb");
if (FALSE === $fp) {
    exit("Failed to open stream to URL");
}

$result = '';

while (!feof($fp)) {
    $result .= fread($fp, 8192);
}
fclose($fp);
echo $result;

file_get_contents

The easiest way, is just using file_get_contents. If does more or less the same as fopen, but you have less options you can choose. A main advantage here is that it requires but one line of code.

$result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
echo $result;

sockets

If you need more control of what headers are sent to the server, you can use sockets, in combination with fopen.

$fp = fsockopen("www.4wtech.com/csp/web/Employee/Login.csp", 80, $errno, $errstr, 30);
if (!$fp) {
    $result = "$errstr ($errno)<br />\n";
} else {
    $result = '';
    $out = "GET / HTTP/1.1\r\n";
    $out .= "Host: www.4wtech.com/csp/web/Employee/Login.csp\r\n";
    $out .= "Connection: Close\r\n\r\n";
    fwrite($fp, $out);
    while (!feof($fp)) {
        $result .= fgets($fp, 128);
    }
    fclose($fp);
}
echo $result;

streams

Alternatively, you can also use streams. Streams are similar to sockets and can be used in combination with both fopen and file_get_contents.

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

$result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp', false, $context);
echo result;

cURL

If your server supports cURL (it usually does), it recommended to use cURL. A key advantage of using cURL, is that it relies on a popular C library commonly used in other programming languages. It also provides a convenient way for creating request headers, and auto-parses response headers, with a simple interface in case of errors.

$defaults = array( 
    CURLOPT_URL, "http://www.4wtech.com/csp/web/Employee/Login.csp"
    CURLOPT_HEADER=> 0
);

$ch = curl_init(); 
curl_setopt_array($ch, ($options + $defaults)); 
if( ! $result = curl_exec($ch)) { 
    trigger_error(curl_error($ch)); 
} 
curl_close($ch); 
echo $result;

Libraries

Alternatively, you can use one of many PHP libraries. I wouldn't recommend using a library, though, as it's likely to be overkill. In most cases, you're better off writing your own HTTP class using cURL under the hood.

The HTML parsing

PHP has a convenient way to load any HTML into a DOMDocument.

$pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
$doc = new DOMDocument();
$doc->loadHTML($pagecontent);
echo $doc->saveHTML();

Unfortunately, PHP support for HTML5 is limited. If you run into errors trying to parse your page content, consider using a third party library. For that, I can recommend Masterminds/html5-php. Parsing an HTML file with this library is very similar to parsing an HTML file with DOMDocument.

use Masterminds\HTML5;

$pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
$html5 = new HTML5();
$dom = $html5->loadHTML($html);
echo $html5->saveHTML($dom);

Alternatively, you can use eg. my library PHPPowertools/DOM-Query. It uses Masterminds/html5-php under the hood for parsing HTML files for parsing an HTML5 string into a DomDocument and symfony/DomCrawler for conversion of CSS selectors to XPath selectors. It always uses the same DomDocument, even when passing one object to another, to ensure decent performance.

namespace PowerTools;

// Get file content
$pagecontent = file_get_contents( 'http://www.4wtech.com/csp/web/Employee/Login.csp' );

// Define your DOMCrawler based on file string
$H = new DOM_Query( $pagecontent );

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query( $H->select('body') );

// Passing a string (CSS selector)
$s = $H->select( 'div.foo' );

// Passing an element object (DOM Element)
$s = $H->select( $documentBody );

// Passing a DOM Query object
$s = $H->select( $H->select('p + p') );

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

score 0 · Answer 4 · answered Feb 10 '09 at 14:17

0

Apologies for the plug, but I've written JS_Extractor for screen scraping. It's actually just a very simple extension of the DOM extension, with some helper methods to makes things a little easier, but it works very well.

answered Feb 10 '09 at 14:17

Jack Sleight

17,010
6
41
55

score 0 · Answer 5 · answered Feb 10 '09 at 21:57

0

The SimpleTest unit testing framework has a Scriptable Browser component, that can be used on its own. I usually use this for screenscraping/bots, because it has the ability to emulate a browser.

answered Feb 10 '09 at 21:57

troelskn

115,121
27
131
155