file_get_contents() returns the wrong page

Question

I'm trying to use the function file_get_contents($url) to scrape some content. but it doesn't return the right content. It just returns some scripts, I think they are responsible for location and language checking and then it fails and doesn't continue scraping the whole page

    $url = 'https://shop.bitmain.com/';
    $exists;
    $url_headers = get_headers($url);
    if(!$url_headers || $url_headers[0] == 'HTTP/1.1 404 Not Found') {
        $exists = false;
    }
    else {
        $exists = true;
    }

    if(filter_var($url, FILTER_VALIDATE_URL) == FALSE || $exists == false) {

        $error .= '<div class="alert alert-danger" role="alert">That city could not be found.</div>';

    } else if (filter_var($url, FILTER_VALIDATE_URL) == TRUE && $exists == true){

        $html = file_get_contents($url);
        if ($html != FALSE && $html != NULL) 
            echo $html

        }

I think it's getting the _right_ contents. The page is clearly a js web app, and the scripts are pretty much all you get when you use your browser to "view source" of the target page.... — random_user_name, Oct 19 '18 at 22:20
open the url in your browser ... view source ... that's what your PHP will get — Jaromanda X, Oct 19 '18 at 22:21
I see that, but the original page contains some data,aka products and their prices. I need these data not the scripts — sc0rp10n.my7h, Oct 19 '18 at 22:22
No, the original page does _not_ contain that. The javascript that the page _runs_ in your browser _gets_ that content.... — random_user_name, Oct 19 '18 at 22:22
"need" is a strong word. I understand you _want_ that information, but it's far more difficult to get than a simple _file_get_contents()_ - you have to run the script, which is an entire _app_, in order to get it to lift in the same way you see it in your browser. — random_user_name, Oct 19 '18 at 22:24
Consider this answer for more hints at how to maybe achieve what you want: https://stackoverflow.com/a/31033582/870729 — random_user_name, Oct 19 '18 at 22:26
Possible duplicate of [file\_get\_contents won't return the source code](https://stackoverflow.com/questions/31031929/file-get-contents-wont-return-the-source-code) — random_user_name, Oct 19 '18 at 22:27

dognose · Accepted Answer · 2018-10-19T22:37:02.790

1

let's call file_get_contents() a "dumb" function when it comes down to loading URL-Content. It will return the content as presented when the DOM has been loaded for the first time.

To get the actual content of MANY websites, you need to follow redirects as well, which you can achieve by using curl (refer to: How to get the real URL after file_get_contents if redirection happens?)

IF the final page uses a lot of AJAX to post-load data, even curl will not deliver the desired content, but some "naked" HTML-Page without actual content.

So, nowadays, you need to manually take care of loading asynchronous content, by parsing the content of the initial url, parsing JS-files, obtaining ajax-urls and call them again while passing cookies the target-page might have generated for your request...

Or use a "native client", which will execute the page just like a browser and is able to return the final data.

just calling file_get_contents("url"); and expecting the same sourcecode, as if you call the url in a browser wont work anymore for the majority of websites.

edited Oct 19 '18 at 22:37

answered Oct 19 '18 at 22:24

dognose

20,360
9
61
107

Which, BTW - if you're linking to another SO answer, suggests you should have voted to close as duplicate, rather than answering... :) – random_user_name Oct 19 '18 at 22:28
1

@cale_b It's not a duplicate question or something. The link just should "add some ideas", why the attempted approach is not working for many cases. – dognose Oct 19 '18 at 22:29
@dognose thanks for your time. How do you think I should solve this problem? I've tried curl and unfortunately it didn't work – sc0rp10n.my7h Oct 19 '18 at 22:35
1

@ibrahim.fathy See my Update. https://shop.bitmain.com/ is highly based on ajax - theres no easy solution with plain-php. – dognose Oct 19 '18 at 22:38

file_get_contents() returns the wrong page

1 Answers1