An efficient way to scrape a webpage

Question

Possible Duplicate:
How to parse and process HTML with PHP?

I want to retrieve the header and footer of a webpage (the owners know this) and display it on a new page so I can add in different content. The page is structured pretty nicely with the content inside a div with an id of content so I figured I could do the following:

Use CURL to retrieve the html Take the html either side of the content Echo it out onto a new page

My problem is I'm not too PHP savvy so I'm not sure how to take the two lumps of html either side. I've used substring in Java before but the substr in PHP seems to work a little differently. Can anyone suggest an alternative?

Thanks

score 2 · Accepted Answer · answered Oct 22 '12 at 16:47

2

Substring and RegEx are not sufficient tools for handling HTML. It would be best (and much easier) to use a DOM parser.

Take a look at the DOMDocument class. It supports loading HTML, and allows you to easily traverse the document.

answered Oct 22 '12 at 16:47

Brad

159,648
54
349
530

I wouldn't even call substring/regex **IN**sufficient tools – Marc B Oct 22 '12 at 16:52
Thanks for the replies. I think the DOM parse would be my best bet for the footer anyway but I may need something else for the header as it isn't as well structured as I previously thought. – MillyMonster Oct 22 '12 at 17:19
@MillyMonster, The document will be parsed into a structured document. – Brad Oct 22 '12 at 18:04

score 1 · Answer 2 · edited May 23 '17 at 10:26

1

To scrape a webpage I used HTML DOM parser. This would be the easiest way for you. You can find more tools in this post: How to parse and process HTML with PHP?

edited May 23 '17 at 10:26

Community

1
1

answered Oct 22 '12 at 16:49

kasp3r

310
1
13

WizzHead · Answer 3 · 2012-10-22T17:07:56.877

I did this very similar thing the other day. I chose to use jQuery, Ajax and PHP to collect the pages and break them down. I have included a diluted version of my code.

For PHP I used CURL (get-url.php):

$requestURL = $_GET['url'];
$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $requestURL);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl_handle, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($curl_handle, CURLOPT_FRESH_CONNECT, TRUE);
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl_handle, CURLOPT_MAXREDIRS, 10);
curl_setopt($curl_handle, CURLOPT_DNS_USE_GLOBAL_CACHE, FALSE);
curl_setopt($curl_handle, CURLOPT_FORBID_REUSE, TRUE);
$content = curl_exec($curl_handle);
curl_close($curl_handle);
echo $content;

Then for Ajax I used:

var url = /* URL you want to retrieve */;
$.ajax({
        url: "get-url.php?url=" + url,
        type: "GET",
        dataType: "html",
        cache: false,
        success: function(data, textStatus, jqXHR){
            var header = data.find('#header').html();
            var footer = data.find('#footer').html();
            $(header_DOM).html(header);
            $(footer_DOM).html(footer);
        }
    });

This is just a guide. Change the idea of this to suit your needs.

An efficient way to scrape a webpage

3 Answers3