-1

I am Trying to Crawl data from website and i did also but the problem is there is load more button, i can only crawl visible data, the data which is coming after click on load-more button that i can't be able to crawl.

Using preg_match_all :

$page = file_get_contents('https://www.healthfrog.in/chemists/medical-store/gujarat/surat');

preg_match_all(
    '/<h3><a href="(.*?)">(.*?)<\/a><\/h3><p><i class="fa fa-map-marker"><\/i>(.*?)<\/p>/s',
    $page,
    $retailers, // will contain the article data
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($retailers as $post) {
    $retailer['name'] = $post[2]; 
    $retailer['address'] = $post[3]; 
    echo "<b>".$retailer['name']."</b><br/>".$retailer['address']."<br/><br/>";
}

Using DOMDocument :

$html = new DOMDocument();
@$html->loadHtmlFile('https://www.healthfrog.in/chemists/medical-store/gujarat/surat');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query('//*[@id="setrecord"]/div[@class="listing "]');

foreach ($nodelist as $n){
    $retailer = $xpath->query('h3/a', $n)->item(0)->nodeValue."<br>";
    $address = $xpath->query('p', $n)->item(0)->nodeValue;
    echo "<b>".$retailer."</b><br/>".$address."<br/><br/>";
}

Any Idea how to grab whole data at a time?

  • Get the url that the load-more button loads? – jeroen Sep 12 '17 at 09:57
  • 1
    https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 seems relevant. – Niet the Dark Absol Sep 12 '17 at 09:57
  • @Niet the Dark Absol : it is not relevant to above link, i request you to please read question and understand it first –  Sep 12 '17 at 10:00
  • You are parsing HTML with regex, that's never a good start. Especially when you don't have control over what you're processing. – Niet the Dark Absol Sep 12 '17 at 10:01
  • @Niet the Dark Absol : my question is not about parsing., i wanted to know how do i get data which is coming after loadmore click –  Sep 12 '17 at 10:03
  • @Niet the Dark Absol , as per your suggestion i did same example using DOMDocument and DOMXPath –  Sep 12 '17 at 11:00

1 Answers1

0

I think you need to try crawling your web page with more efficient way.

My first suggestion for you is using PhantomJs as a complex web engine in command line. Which means you can execute phantom js operations(in javascript) for getting web pages, triggering some dom events and getting data that you need with php exec command.

PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

// Simple Javascript example

console.log('Loading a web page');
var page = require('webpage').create();
var url = 'http://phantomjs.org/';
page.open(url, function (status) {
  //Do your dom operations( click read more button or anything else) and just console.log(yourDataThatYouNeed)
  phantom.exit();
});

For getting data you need to a php driver for PhantomJs.

Here an example Php Client For PhantomJS => https://github.com/jonnnnyw/php-phantomjs

Actualy I have a php driver for phantomJs that I developed as a side project and i'll planning to publish on my github account in next days.

The second way(frankly in my the opinion right way for complex projects) that I'm suggesting to you is using a scraping framework like scrapy. You can take a look to documentation for how to scraping data from web pages with scrapy.

Scrapy is a powerful framework for extracting the data you need from websites, based on python.

You can take a look to this tutorial for using scrapy https://docs.scrapy.org/en/latest/intro/tutorial.html