0

Possible Duplicate:
How do I write a web scraper in Ruby?

I need to scrape the source code of many websites that are listed in my app's database. I'm checking to see if they're linking back to my site.

Is it possible to do using Ruby on Rails, or should I use PHP?

Community
  • 1
  • 1
Ian Mason
  • 136
  • 12
  • there are easier way of finding who is linking to you –  Aug 21 '12 at 03:09
  • @Dagon: Perhaps you could elaborate about these other ways. Maybe write an answer? – icktoofay Aug 21 '12 at 03:11
  • Dagon, I'm not trying to find out who is linking to me - there are tons of APIs out there for that. I already know who I bought links from but I need to know if they delete my link. Id like my app to alert me – Ian Mason Aug 21 '12 at 03:40

5 Answers5

3

You could just grab the list of websites, and run curl through each of them.

Edit: Alternatively, you could try this awesome lib, simple dom parser (http://simplehtmldom.sourceforge.net):

<?php

require 'simple_html_dom.php';

define(MYWEBSITE, "google.com");
$html = file_get_html('http://www.google.com/');

foreach($html->find('a') as $link) {
  $url =  $link->href;
  if (!strpos($url, MYWEBSITE)) {
    // Do whatever you need to do here, we'll just simply echo out
    // the website URL that has your site URL in it.
    echo $url . " contains " . MYWEBSITE ."\n";
  }
}

?>

Just a simple hack, but it does the job.

Andrew G.
  • 439
  • 4
  • 10
  • Can I check the source of those urls for the existence of a string like "mywebsite.com" using curl? – Ian Mason Aug 21 '12 at 04:05
  • `curl` just grabs the source code, you still need to parse it. There's a lib that can do both. Check out my edit. – Andrew G. Aug 21 '12 at 04:59
2

It is really simple to scrape with ruby.. Lots of library's for it but I have found that the best all around is mechanize (which uses nokogiri for parsing). However it is smart about cookies, can easily manipulate forms, and has an easy to use/flexible API.

Also, if you don't want to use css selectors and what not you can download the file and parse the data yourself (as in look for certain characters or what not).

I've used both PHP and Ruby extensively and personally I prefer Ruby because it is much more elegant to code in and your code is typically shorter. With that being said, if you are new to programming then PHP may be slightly easier for someone with limited programming experience.

Matt Wolfe
  • 8,924
  • 8
  • 60
  • 77
  • Thank you very much Matt. I and someone from oDesk are working together on something using Mechanize :) – Ian Mason Aug 21 '12 at 09:39
1

I have used both Ruby and PHP to scrape sites.

One thing that I really like about Ruby is that you can easily make your scraping multi-threaded. This way, you run your script and scrape 10 - 100 website simultaneously (PHP is super big pain to make it multithreaded).

I found a lot of great tools in Ruby for scraping and PHP has others.

My Vote is Ruby because the ease of threading you can quickly populate your database and find issues with your code pretty quickly instead of having to wait ages with PHP.

John Ballinger
  • 7,380
  • 5
  • 41
  • 51
0

Ruby on Rails is a framework for building web applications, not scraping them. PHP is a language typically used to build web sites/applications.

There are likely web scraping modules for both, Google will tell you what they are...

This looks like a decent step-by-step post about scraping using Ruby: http://www.andrewsturges.com/2011/09/how-to-harvest-web-data-using-ruby-and.html

Nick Veys
  • 23,458
  • 4
  • 47
  • 64
  • Agreed - I did find Nokogiri just now, but honestly still not sure if I should just use PHP instead. Then some people swear by using jQuery to scrae. Do you have any opinion on that? – Ian Mason Aug 21 '12 at 03:41
  • Everything I find on Nokigiri is about scraping using css selectors.. What I"m trying to do is much simpler yet nobody covers it.. of course :/ – Ian Mason Aug 21 '12 at 04:11
  • If you're not scraping based on selectors, you're probably what, using regular expressions? In which case the language you use to do it is completely up to you, I'd pick whatever you're best at. – Nick Veys Aug 22 '12 at 20:47
0

PHP would make this really easy as the curl usage is very straightforward: http://www.php.net/manual/en/function.curl-exec.php

And there are some libraries supporting advanced usage already:
http://simplehtmldom.sourceforge.net/
http://electrokami.com/coding/simple-html-dom-baked-cakephp-component/

<?php
$mySite = "http://www.mysite.com";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER , true);

// grab URL and save data into variable
$response = curl_exec($ch);
if(stripos($response,$mySite) !== false){
    echo "site still linked";   
}

// close cURL resource
curl_close($ch);
?>
Cameron
  • 1,524
  • 11
  • 21