How do I write a web scraper in Ruby?

Question

I would like to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)

E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.

How would I do that in Ruby ?

I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.

I understand RegExs, etc.

Protip: Don't use regexes to parse HTML - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Andrew Grimm, May 10 '11 at 23:09
@AndrewGrimm When you say this, you mean I should use something like Nokogiri, right? I have started using Nokogiri but I am also using RegExes in some of the matches of tags and content on the pages. Is it safe to assume that is not what you meant? — marcamillion, May 23 '12 at 22:24
Regexes are a bad idea for dealing with HTML. Stick with Nokogiri to do your work for you. — Andy Lester, Aug 17 '13 at 23:29
There's a library called Mechanize that sits above Nokogiri and offers most of what you need. — arnt, Apr 20 '18 at 20:19

score 21 · Accepted Answer · edited Dec 10 '14 at 20:24

21

Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.

edited Dec 10 '14 at 20:24

Dylan Davidson

7
3

answered May 10 '11 at 08:16

Geo

93,257
117
344
520

Btw, is there anything I should know about this before I do it ? E.g., is it easy for me to create something simple that will just scrape all of the entire site, hog lots of bandwidth and let me get IP banned ? Or would I have to engineer it to do that ? – marcamillion May 10 '11 at 08:33
You can implement your crawler any way you see fit. Since Mechanize allows you to automate almost anything that your browser can do, you have full freedom. – Geo May 10 '11 at 08:57

score 7 · Answer 2 · answered Feb 07 '12 at 04:08

7

If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.

answered Feb 07 '12 at 04:08

Felipe Lima

10,530
4
41
39

Filipe Miguel Fonseca · Answer 3 · 2011-05-10T14:46:02.893

6

I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.

All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:

doc.search("//p[@class='posted']")

edited May 10 '11 at 14:46

answered May 10 '11 at 14:39

Filipe Miguel Fonseca

6,400
1
31
26

score 2 · Answer 4 · answered Aug 17 '13 at 23:08

2

Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.

answered Aug 17 '13 at 23:08

RubeOnRails

1,153
11
22

jassa · Answer 5 · 2014-02-22T00:41:28.670

Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api

The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.

E.g.

Post = HyperAPI.new_class do
  string title: 'div#title'
  string body: 'div#body'
  string author: '#details .author'
  integer comments_count: '#extra .comment' do
    size
  end
end
# => Post

post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>

How do I write a web scraper in Ruby?

5 Answers5

Linked

Related