0

I have a php script that loads this webpage to extract some data from it's tables.
The following methods failed to get it's table contents:

Using file_get_contents:

$document -> file_get_contents("http://www.webpage.com/");
print_r($document);

Using cURL:

$document = curl_init('http://www.webpage.com/');
curl_setopt($document, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($document);
print_r($html);

Using loadHTMLFile:

$document->loadHTMLFile('http://www.webpage.com/');
print_r($document);

I'm not an expert in php and except the first method, the other ones are copied from StackOverflow's answers.
What am I doing wrong?
and How they do block some contents from loading?

3 Answers3

1

Not the answer you're likely to want to hear, but none of the methods you describe will evaluate JavaScript and other browser resources as a normal browser client would. Instead, each of those methods retrieves the contents of only the file you've specified. A quick glance at the site you're targeting clearly shows this table in question being populated as the result of an AJAX call, which none of the methods you've tried are able to evaluate.

You'll need to lean on a library or script that has the capability for this type of emulation; namely laravel/dusk, the PHP bindings for Selenium webdriver, or something similar.

esqew
  • 42,425
  • 27
  • 92
  • 132
  • I need to execute my php script with Cron. Is it possible to install these libraries/scripts on a server? I read the documentation and it says about WebDrivers an browsers. I didn,t understand a single word. :) – Mehdi Maazi Jun 23 '20 at 02:12
  • 1
    Mehdi, as blunt as this may sound - if you don't understand the documentation for these libraries or exactly what they do, you should take a pause to get a better understanding of these libraries and your PHP fundamentals. The philosophy you employ should be to never under any circumstances run code that you don't know what it does - for all you know, I could have linked you to malicious libraries that install security backdoors on your servers! – esqew Jun 23 '20 at 14:53
  • Thanks for advice sir. I am an electronic engineer and I have been assigned to build a billboard to show live currency and gold rates. All I need to complete this project is to extract those data from that website. I know the risks of using these kind of tools and libraries without fully understanding them.That's why I'm seeking your advice/help as professionals. Is there any tutorial/manual to guide me safely to accomplish this task? Implementing WebDrivers seems to be a professional task and there is no documentation to help beginners like me. – Mehdi Maazi Jun 23 '20 at 19:54
  • Is your requirement specifically to scrape the data from the site? Is there an API you could use instead? That likely require significantly less legwork to get up and running. – esqew Jun 23 '20 at 21:14
  • Unfortunately this site has the most reliable data & I don't have many options on data sources. – Mehdi Maazi Jun 24 '20 at 03:52
  • I was thinking about a crazy or dumb idea. Is it possible to make an html page that loads the target page's contents? If I call that html in php, is it possible to fetch static html resources from it? I don't know if I made my point. Instead of loading target webpage directly, is it possible to load it from an HTML page that has static content (like a snapshot) of the target? – Mehdi Maazi Jun 24 '20 at 18:12
  • By the way, you mentioned APIs. Could you please give me a hint about it. There is another website that I dig into it's codes to get my data and it's horrible.(http://www.tgju.org/) It has an API. A div block that I can implement it in my HTML. If I could use API's data instead of a whole webpage, it would decrease my server's load. – Mehdi Maazi Jun 24 '20 at 18:24
0

This is what I did to scrape data from a webpage using php curl:

    // Defining the basic cURL function
    function curl($url) {
        $ch = curl_init();  // Initialising cURL
        curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL
        return $data;   // Returning the data from the function
    }


// Defining the basic scraping function
    function scrape_between($data, $start, $end){
        $data = stristr($data, $start); // Stripping all data from before $start
        $data = substr($data, strlen($start));  // Stripping $start
        $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
        $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
        return $data;   // Returning the scraped data from the function
    }


$target_url = "https://www.somesite.com";
                                

$scraped_website = curl($target_url);  

$data_set_1 = scrape_between($scraped_website, "%before%", "%after%");
$data_set_2 = scrape_between($scraped_website, "%before%", "%after%");

The %before% and %after% is data that always shows up on the webpage before and after the data you wish to grab. Could be div tags or some other html tags that are unique to the data you wish to grab.

David
  • 17
  • 6
  • Not sure what you mean by "Load the tables." What this code will do is put the HTML of the page into a variable, then you can extract the data that resides between two defined data sets. Are you attempting to display a table from another site on your site? – David Jun 23 '20 at 20:49
  • I just want to get raw data and extract some information from it. This site uses AJAX to load those information and your method doesn't get them. – Mehdi Maazi Jun 24 '20 at 03:54
0

So maybe look into using curl and and imitate the same ajax request that the site is using? When I searched for that, this is what I found: Mimicking an ajax call with Curl PHP

David
  • 17
  • 6