0

For example, I wish to mine https://stackoverflow.com/privileges/user/3 and get the data that is in the div <div class="summarycount al">6,525</div> so that I can add the reputation to a local db along with the usernumber. I think I can use file_get_contents

 $data = file_get_contents('https://stackoverflow.com/privileges/user/3');

How do I extract the required data i.e 6,525 in the above example?

Community
  • 1
  • 1
abel
  • 2,377
  • 9
  • 39
  • 62

1 Answers1

2
  1. You'll need to login (through PHP) to see relevant information. This isn't very straightforward and will require some work.

  2. You can use *shrugs* regex to parse data, or use an XML parser like PHP Simple HTML DOM Parser. With regex...:

    preg_match('!<div class="summarycount al">(.+?)</div>!', $contents, $matches);
    $rep = $matches[1];
    
  3. If you are scraping SO, you can use the SO API instead.

Code:

$url = 'http://api.stackoverflow.com/1.0/users/3';

$tuCurl = curl_init(); 
curl_setopt($tuCurl, CURLOPT_URL, $url); 
curl_setopt($tuCurl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($tuCurl, CURLOPT_ENCODING, 'gzip'); 

$data = curl_exec($tuCurl); 
$parse = json_decode($data, true);
$rep = $parse['users'][0]['reputation'];

echo $rep;
999999
  • 1,873
  • 3
  • 14
  • 20
  • thanks for the attempt. I am really bad at regex. I will go through it.The curent page does not need login so no worries. And this was a generic question with SO as an example. The code works! Thanks – abel Oct 07 '10 at 16:36
  • Time taken 2.11 seconds. Getting 10000 users will take 5.6 hrs. Can I complete the entire thing in one script without timeouts? – abel Oct 07 '10 at 16:42
  • @abel Yes, you can change the `max_execution_time` setting. I would strongly recommend using the SO API though, or downloading a [data-dump](http://blog.stackoverflow.com/2010/10/creative-commons-data-dump-oct-10/) and getting info from there. – 999999 Oct 07 '10 at 16:46
  • This isn't about SO per se, I have played with the execution time setting, can I get Burstable output more here http://stackoverflow.com/questions/3884008/burstable-output-to-long-running-scripts – abel Oct 08 '10 at 08:59