1

I am trying to learn how to build a facebook group crawler that gets information from the group (a list of posts from the group with information of who wrote the post, post id, post date, ect'.

It's important to for me to state that I am in the beginning of my research of page crawling!

Found a nice tutorial from this page: http://www.oooff.com/php-scripts/basic-curl-scraping-php/basic-scraping-with-curl.php

When running this code:

<?php
    $url = "http://www.oooff.com/";

    $ch = curl_init($url);                              // initialize the CURL library in my PHP script so we can later work on it - inside the handler. 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);     // curl_setopt() function is used to set options on the $ch handler.// in this case we use the CURLOPT_RETURNTRANSFER option 
    $curl_scraped_page = curl_exec($ch);                //  "run all the stuff we've set" - return the data scraped to the variable $curl_scraped_page
    curl_close($ch);



    echo $curl_scraped_page;
?>

It works, but sometimes when I run it I get a blank page.
when I run it on facebook (or more specificaly on a FB group because that's what I need) I get a blank page. I tried running it on yahoo.com and I get the same result.

  • Why is that happening?
  • What is the right way to get a page content?
Imnotapotato
  • 5,308
  • 13
  • 80
  • 147
  • 4
    pages aren't just html anymore. They're javascript. They're css. They're content from multiple different domains/sites. And since you're just starting out, here's a very very important tip: NEVER assume success when dealing with an external resources. ALWAYS assume failure, check for failure, and treat success as a pleasant surprise. In your case `if ($curl_scraped_page === false) { die(curl_error($ch)); }` – Marc B Nov 06 '14 at 14:14
  • 1
    if it's not boolean false, then curl succeeded, but didn't return anything. a `var_dump($curl_scraped_page)` will show what's in it. possibly an emtpy string, or something non-printable. – Marc B Nov 06 '14 at 14:25
  • Debugging this is a good idea, I tried using the error function before without any success. Now it worked. I ran this sctipt on a FB group and got this message: "SSL certificate problem: unable to get local issuer certificate" – Imnotapotato Nov 06 '14 at 14:26

1 Answers1

4

If your are mainly interested in facebook content, you might use the facebook api for php: https://developers.facebook.com/docs/reference/php/

CURL does only load the file content, but does not run JavaScript of a webpage.

According to Vivin Paliath answer PhantomJs might be a good solution to get content from a JavaScript webpage:

[...] PhantomJS is a headless WebKit browser. It has its own API that lets you "script" behavior. So you can tell PhantomJS to load the page and dump out the data you need.

Community
  • 1
  • 1
Henrik
  • 2,771
  • 1
  • 23
  • 33
  • I know about this option. I don't know how to use the Graph API. Should I run the script on my server? Because I don't understand from the FB information. Either way, it seems very restrictive for some reason (am I right?). – Imnotapotato Nov 06 '14 at 14:33
  • 1
    @Hatul, I suggest looking for tutorials: http://bit.ly/1x6NzVq . For sure, in your case using Facebbok API is the best way to get info that you need. – ivstas Nov 06 '14 at 14:54
  • The thing is that I will want to crawl other pages that are not facebook that don't have the Graph API so either way I have to learn how to do this and become a real spiderman ;) Thanks for the information, it very interest me I will take a look at it later. +1 for you :) ! – Imnotapotato Nov 06 '14 at 14:57