10

I've been trying to retrieve the contents of a webpage (http://3sk.tv) using file_get_contents. Unfortunately, the resulting output is missing many elements (images, formating, styling, etc...), and just basically looks nothing like the original page I'm trying to retrieve.

This has never happened before with any other URLs I have tried retrieve using this same method, but for some reason, this particular URL (http://3sk.tv) refuses to work properly.

The code I'm using is:

<?php
$homepage = file_get_contents('http://3sk.tv');
echo $homepage;
?>

Am I missing anything? All suggestions on how to get this working properly would be greatly appreciated. Thank you all for your time and consideration.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
jameslanvin
  • 103
  • 1
  • 7
  • I would recommend using `cURL` for this. [see here for details](https://davidwalsh.name/curl-download). Also be weary, scraping is not always legal... – chriz Dec 16 '15 at 14:48
  • Tried using the curl implementation you referred to, unfortunately there was no change at all. thanks for your input. – jameslanvin Dec 16 '15 at 19:59
  • 1
    Btw this is for a uni research paper, not scraping purposes – jameslanvin Dec 16 '15 at 20:09

4 Answers4

6

Thats normal behaviour, as you are only grabbing the file, and not related images, stylesheets etc...

RFLdev
  • 186
  • 9
  • Your absolutley right it does not reload images or CSS... Any ideas/suggestions on how to retrieve the entire content?? – jameslanvin Dec 16 '15 at 20:09
4

I have one quick workaround to fix relative paths

http://www.w3schools.com/tags/tag_base.asp

Just add to your code <base> tag.

<?php
$homepage = file_get_contents('http://3sk.tv');
echo str_replace(
   '<head>', 
   '<head><base href="http://3sk.tv" target="_blank">',
    $homepage
);
?>

It's should help.

z1m.in
  • 1,661
  • 13
  • 19
  • Hi @jQuery00, tried using your suggested method, there was some improvement in the final output (images in the body appeared) but still many elements of the CSS & styling are missing. Any suggestions would be highly appreciated. Thanks – jameslanvin Dec 16 '15 at 20:02
  • Hi @jameslanvin good news for you. I found a problem and updated the question. Now work like a charm! – z1m.in Dec 16 '15 at 20:15
  • 1
    Just tested it again, you sir, are the file_get_contents whisperer! awesome. Works almost perfect! thanks – jameslanvin Dec 16 '15 at 22:11
3

This is to be expected. If you look at the source code, you'll notice many places which do not have a full URL (ex lib/dropdown/dropdown.css). This tells the browser to assume http://3sk.tv/lib/dropdown/dropdown.css. However, on your website, it will be YOURURL.COM/lib/dropdown/dropdown.css, which does not exist. This will be the case for much of the content.

So, you can't just print another website's source and expect it to work. It needs to be the same URL.

The best way to embed another website is usually to just use an iframe or some alternative.

Community
  • 1
  • 1
Daniel Centore
  • 3,220
  • 1
  • 18
  • 39
2

The webpage is not completely generated server-side, but it relies heavily on JavaScript after the HTML part loads. If you are looking for rendering the page as it looks in browser, you may need a headless browser instead - see e.g. this binding to PhantomJS: http://jonnnnyw.github.io/php-phantomjs/

Piskvor left the building
  • 91,498
  • 46
  • 177
  • 222
  • 1
    (as for "this never happened before" - brace yourself; you were lucky so far, this happens pretty much all the time) – Piskvor left the building Dec 16 '15 at 15:05
  • "brace yourself" <= good to know thanx. I'm currently testing the solution you suggested (just waiting for phantomjs to build... long process) will let you know how this went soon as its done. – jameslanvin Dec 16 '15 at 20:00