18

I would like to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)

E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.

How would I do that in Ruby ?

I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.

I understand RegExs, etc.

Andrew Grimm
  • 78,473
  • 57
  • 200
  • 338
marcamillion
  • 32,933
  • 55
  • 189
  • 380
  • 1
    Protip: Don't use regexes to parse HTML - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Andrew Grimm May 10 '11 at 23:09
  • @AndrewGrimm When you say this, you mean I should use something like Nokogiri, right? I have started using Nokogiri but I am also using RegExes in some of the matches of tags and content on the pages. Is it safe to assume that is not what you meant? – marcamillion May 23 '12 at 22:24
  • Regexes are a bad idea for dealing with HTML. Stick with Nokogiri to do your work for you. – Andy Lester Aug 17 '13 at 23:29
  • There's a library called Mechanize that sits above Nokogiri and offers most of what you need. – arnt Apr 20 '18 at 20:19

5 Answers5

21

Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.

Geo
  • 93,257
  • 117
  • 344
  • 520
  • Btw, is there anything I should know about this before I do it ? E.g., is it easy for me to create something simple that will just scrape all of the entire site, hog lots of bandwidth and let me get IP banned ? Or would I have to engineer it to do that ? – marcamillion May 10 '11 at 08:33
  • You can implement your crawler any way you see fit. Since Mechanize allows you to automate almost anything that your browser can do, you have full freedom. – Geo May 10 '11 at 08:57
7

If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.

Felipe Lima
  • 10,530
  • 4
  • 41
  • 39
6

I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.

All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:

doc.search("//p[@class='posted']")
Filipe Miguel Fonseca
  • 6,400
  • 1
  • 31
  • 26
2

Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.

RubeOnRails
  • 1,153
  • 11
  • 22
1

Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api

The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.

E.g.

Post = HyperAPI.new_class do
  string title: 'div#title'
  string body: 'div#body'
  string author: '#details .author'
  integer comments_count: '#extra .comment' do
    size
  end
end
# => Post

post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>
jassa
  • 20,051
  • 4
  • 26
  • 24