Get HTML content from another site

Question

I would like to dynamically retrieve the html contents from another website, I have the permission of the company.

Please, don't point me to JSONP, because I can't edit Site A, only Site B

define `permission of the company`. That means nothing unless they send `Access-Control-Allow-Origin` header — Esailija, Jul 11 '12 at 18:19
Do you use a server side language? You could get the page using your server side language and then display that on your page. — Stefan H, Jul 11 '12 at 18:20
It's a shipping company, they don't have an API, so they allowed us to use the index.php?trackingnumber=xxxxx query. — Souza, Jul 11 '12 at 18:20
Do you have access to any server-side language? If so, which one? Unfortunately, you will need to use a server-side solution, as cross-domain security will thwart any effort to retrieve data from a remote domain. JSONP is also not appropriate because the return data will be HTML, not javascript. Your only route here is server-side or iframes, and the latter is probably not adequate. — Chris Baker, Jul 11 '12 at 18:23
@StefanH I do use server-side language, php, how would i do it using it? — Souza, Jul 11 '12 at 18:36

Chris Baker · Accepted Answer · 2012-07-11T19:57:03.937

Because of cross-domain security issues, you won't be able to do this client-side, unless you're content with an iframe.

With PHP, you can use several methods of "scraping" the content. The approach you use depends on whether you need to use cookies in your requests (i.e. the data is behind a login).

Either way, to start things off on the client side you'll issue a standard AJAX request to your own server:

$.ajax({
  type: "POST",
  url: "localProxy.php",
  data: {url: "maybe_send_your_url_here.php?product_id=1"}
}).done(function( html ) {
   // do something with your HTML!
});

If you need cookies set (if the remote site requires login, you need 'em), you're going to use cURL. The full mechanics of logging in with post data and accepting cookies is a little beyond the scope of this answer, but your requests would look something like this:

$ch = curl_init(); 
curl_setopt ($ch, CURLOPT_URL, 'http://thirdpartydomain.internet/login_url.php'); 
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE); 
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"); 
curl_setopt ($ch, CURLOPT_TIMEOUT, 60); 
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 0); 
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.jar'); 
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'email='.$username.'&password='.$password); 
curl_setopt ($ch, CURLOPT_POST, 1); 
$result = curl_exec ($ch); 
curl_close($ch);

At that point, you can check the $result variable and make sure the login worked. If so, you'd then use cURL to issue another request to grab the page content. The second request won't have all the post junk, and you'd use the URL that you're trying to fetch. You'd end up with a large string full of HTML.

If you only need a portion of that page's content, you can use the method below to load the string into a DomDocument, use the loadHTML method instead of loadHTMLFile (see below)

Speaking of DomDocument, if you don't need cookies, then you can use DomDocument directly to fetch the page, skipping cURL:

$doc = new DOMDocument('1.0', 'UTF-8');
// load the string into the DOM (this is your page's HTML), see below for more info
$doc->loadHTMLFile ('http://third_party_url_here.php?query=string');

// since we are working with HTML fragments here, remove <!DOCTYPE 
$doc->removeChild($doc->firstChild);            

// remove <html></html> and any junk
$body = $doc->getElementsByTagName('body'); 
$doc->replaceChild($body->item(0), $doc->firstChild);

// now, you can get any portion of the html (target a div, for example) using familiar DOM methods

// echo the HTML (or desired portion thereof)
die($doc->saveHTML());

Documentation

HTML iframe on MDN - https://developer.mozilla.org/en/HTML/Element/iframe
jQuery.ajax() - http://api.jquery.com/jQuery.ajax/
PHP's cURL - http://php.net/manual/en/book.curl.php
Curl::set_opt (information about using cookies) - http://www.php.net/manual/en/function.curl-setopt.php
PHP's DomDocument - http://php.net/manual/en/class.domdocument.php
DomDocument::loadHTMLFile - http://www.php.net/manual/en/domdocument.loadhtmlfile.php
DomDocument::loadHTML - http://www.php.net/manual/en/domdocument.loadhtml.php

Thank you very much for the very well explained answer :), up voted! About the DOMDocument portion that i'm using, i think it's not working and it's the approach i prefer . The page says only on the title "Object moved" (I tried to put echo $doc->saveHTML(); didn't work) http://www.iloja.pt/ajaxload/urbanosapi.php — Souza, Jul 11 '12 at 19:29
That indicates the third-party site is returning that in response to your request. Try to add `die($url)` right before your `loadHTMLFile` call to debug exactly which URL is being used, then copy-paste that URL straight into the browser and verify that you actually get the content you expect. — Chris Baker, Jul 11 '12 at 19:35
Chris, this is the URL being printed out : http://expresso.urbanos.com/public/?ns=9000014294991 — Souza, Jul 11 '12 at 19:39
The problem is in the code that attempts to remove the `html` and `body` tags. I've updated the code... if you're going to be using a chunk of the site (like you grab a certain div and just use that HTML) then you don't need to worry about that part at all. If you are going to use it all, you'll need to dial in the code that seeks to extract the contents of the body. Just remember, `DomDocument` works very much like javascript's DOM manipulation, so if you do it in javascript you can do it there. — Chris Baker, Jul 11 '12 at 19:49
Hey Chris, thank you so much for all the quick feedback. Unfortunately i think the problem now is another: `Warning: DOMDocument::saveHTML() [domdocument.savehtml]: output conversion failed due to conv error, bytes 0x88 0xE4 0x61 0x09 in /home/iloja/public_html/ajaxload/urbanosapi.php on line 13` — Souza, Jul 11 '12 at 19:53
Try the example code again, it was missing a line. The error is caused by an encoding problem. When I run the sample on my local server, it functions correctly. Note the explicit declaration of utf-8 in the `DomDocument` constructor. — Chris Baker, Jul 11 '12 at 19:59
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/13755/discussion-between-souza-and-chris) — Souza, Jul 11 '12 at 20:02

Get HTML content from another site

1 Answers1

Linked

Related