-1

I'm trying to use the function file_get_contents($url) to scrape some content. but it doesn't return the right content. It just returns some scripts, I think they are responsible for location and language checking and then it fails and doesn't continue scraping the whole page

    $url = 'https://shop.bitmain.com/';
    $exists;
    $url_headers = get_headers($url);
    if(!$url_headers || $url_headers[0] == 'HTTP/1.1 404 Not Found') {
        $exists = false;
    }
    else {
        $exists = true;
    }

    if(filter_var($url, FILTER_VALIDATE_URL) == FALSE || $exists == false) {

        $error .= '<div class="alert alert-danger" role="alert">That city could not be found.</div>';

    } else if (filter_var($url, FILTER_VALIDATE_URL) == TRUE && $exists == true){

        $html = file_get_contents($url);
        if ($html != FALSE && $html != NULL) 
            echo $html

        }
sc0rp10n.my7h
  • 341
  • 3
  • 15

1 Answers1

1

let's call file_get_contents() a "dumb" function when it comes down to loading URL-Content. It will return the content as presented when the DOM has been loaded for the first time.

To get the actual content of MANY websites, you need to follow redirects as well, which you can achieve by using curl (refer to: How to get the real URL after file_get_contents if redirection happens?)

IF the final page uses a lot of AJAX to post-load data, even curl will not deliver the desired content, but some "naked" HTML-Page without actual content.


So, nowadays, you need to manually take care of loading asynchronous content, by parsing the content of the initial url, parsing JS-files, obtaining ajax-urls and call them again while passing cookies the target-page might have generated for your request...

Or use a "native client", which will execute the page just like a browser and is able to return the final data.

just calling file_get_contents("url"); and expecting the same sourcecode, as if you call the url in a browser wont work anymore for the majority of websites.

dognose
  • 20,360
  • 9
  • 61
  • 107
  • Which, BTW - if you're linking to another SO answer, suggests you should have voted to close as duplicate, rather than answering... :) – random_user_name Oct 19 '18 at 22:28
  • 1
    @cale_b It's not a duplicate question or something. The link just should "add some ideas", why the attempted approach is not working for many cases. – dognose Oct 19 '18 at 22:29
  • @dognose thanks for your time. How do you think I should solve this problem? I've tried curl and unfortunately it didn't work – sc0rp10n.my7h Oct 19 '18 at 22:35
  • 1
    @ibrahim.fathy See my Update. https://shop.bitmain.com/ is highly based on ajax - theres no easy solution with plain-php. – dognose Oct 19 '18 at 22:38