1

I am trying to scrape comments that are generated via iframe from another domain. When I am trying to do so I am either getting a null message that says this application is not registered.I do understand that this is due to cross domain issues.I have written the following code in php using Curl.When i pass the parent url it loads the page but the content under the iframes are missing and when i pass the child url,it returns a message saying application not registered.

Code:

<?php

// 1. initialize

$ch = curl_init();

// 2. The URL containing the iframe

$url = "http://www.ndtv.com/india-news/1993-mumbai-blasts-convict-yakub-    memons-final-mercy-plea-rejected-783656?pfrom=home-lateststories";

// 3. set the options, including the url

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// 4. execute and fetch the resulting HTML output by putting into $output
$output = curl_exec($ch);

// 5. free up the curl handle  
curl_close($ch);

// 6. Scrape for a single string/word ("Paris")  
preg_match("~</?p[^>]*>~", $output, $match);
   if($match)

// 7. Display the scraped string  
echo $output;
?>

The child url for iframe is

http://social.ndtv.com/static/Comment/Widget/?&key=68a2a311a51a713dad2e777d65ec4db4&link=http%3A%2F%2Fwww.ndtv.com%2Findia-news%2F1993-mumbai-blasts-convict-yakub-memons-final-mercy-plea-rejected-783656&title=Yakub+Memon+to+Hang+On+July+30+for+India%27s+Deadliest+Terror+Attack&ctype=story-news&identifier=story-news-783656&enableCommentsSubscription=1&ver=1&reply=1&sorted_by=likes

Is there any way by which I can access the iframe content.I want this data form analysis and not for any illegal usage.

Thanks for the help in advance.

user3818862
  • 85
  • 1
  • 9
  • If the comments are being loaded dynamically using JavaScript, then cURL or PHP won't be able to magically load them. You'll need to use something like [PhantomJS](http://phantomjs.org/) to emulate a browser loading the page, then extract the results from it. – Mr. Llama Jul 21 '15 at 17:08
  • That's not totally the case here. You can get the first 20 comments, after that yea you can't just use Curl – PHPhil Jul 21 '15 at 17:38
  • @PHPhil thanks for replying but can you help me to get the first 20 comments by modifying my code,that would be a great temporary solution. – user3818862 Jul 21 '15 at 18:58
  • @Mr.Llama if I use PhantomJS as suggested will I be able to navigate through the child iframe or i may be denied due to cross domain issues – user3818862 Jul 21 '15 at 18:59

2 Answers2

0

You need to actually parse the HTML... regular expressions are not for html.

See: RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Rob W
  • 9,134
  • 1
  • 30
  • 50
  • that's not the issue here I am unable to navigate through my iframe due to cross browsing issues any suggestions??? – user3818862 Jul 21 '15 at 19:02
  • Ah. Misunderstood. What if you curl the iframe url? – Rob W Jul 21 '15 at 22:11
  • Sorry for that...when i Curl the iframe url it says application not registered that's because iframe is located on another domain – user3818862 Jul 22 '15 at 01:57
  • I'm able to access the url.. i don't see why a curl request couldn't. – Rob W Jul 22 '15 at 02:05
  • Hi @Half Crazed thanks for trying that's what I meant,you cannot access it because of security issues by the child domain i.e. iframe url domain which is not on the same domain of the parent url(Main Page)..Is there any hack available by which I can bypass this security authentication. – user3818862 Jul 22 '15 at 05:15
  • You could forge the headers.. but that's a little immoral. Maybe there's an api you could use instead? – Rob W Jul 22 '15 at 11:07
0

If you want the discussion comments then need to fetch the comment section's iframe URL, not the page that contains the iframe. cURL simply returns the URL's source code, it doesn't recursively follow iframe links and embed them as well.

Mr. Llama
  • 20,202
  • 2
  • 62
  • 115
  • Llama i did trying passing the iframe url but it returns a message saying application is not registered.Please help – user3818862 Jul 21 '15 at 18:52