2

I don't want to download the whole web page. It will take time and it needs lot of memory.

How can i download portion of that web page? Then i will parse that.

Suppose i need to download only the <div id="entryPageContent" class="cssBaseOne">...</div>. How can i do that?

cola
  • 12,198
  • 36
  • 105
  • 165
  • 3
    You can't. Not unless you have control over the server and can customize the response. In that case you can send back a partial view. – Rob Apr 10 '12 at 16:26
  • Perhaps: http://stackoverflow.com/questions/1538952/retrieve-partial-web-page – mikevoermans Apr 10 '12 at 16:27
  • @Rob: the server can't exactly force the client not to close the socket before it's read all of the data. – Wooble Apr 10 '12 at 16:31

2 Answers2

5

You can't download a portion of a URL by "only this piece of HTML". HTTP only supports byte ranges for partial downloads and has no concept of HTML/XML document trees.

So you'll have to download the entire page, load it into a DOM parser, and then extract only the portion(s) you need.

e.g.

$html = file_get_contents('http://example.com/somepage.html');
$dom = new DOM();
$dom->loadHTML($html);
$div = $dom->getElementById('entryPageContent');

$content = $div->saveHTML();
Marc B
  • 356,200
  • 43
  • 426
  • 500
  • For `$html = file_get_contents('http://example.com/somepage.html');` , where does it download that file temporary? In memory? Or somewhere in hard disk? – cola Apr 14 '12 at 12:28
  • It'll go directly into $html. If you want to have it written to disk, you'll need to write it out yourself. – Marc B Apr 14 '12 at 22:24
  • So does it store that webpage source into memory? – cola Apr 16 '12 at 17:18
  • PHP variables are by definition "in memory". You'll get the html of that url and nothing else. curl/file_get_contents are not browsers and will not 'spider' a page and download all the content in it. – Marc B Apr 16 '12 at 18:33
0

Using this:

curl_setopt($ch, CURLOPT_RANGE, "0-10000");

will make cURL download only the first 10k bytes of the webpage. Also it will only work if the server side supports this. Many interpreted scripts (CGI, PHP, ...) ignore it.

kuba
  • 7,329
  • 1
  • 36
  • 41