How to fetch all pages of specific website with Ruby on Rails

Question

I'm currently building a website using Ruby on Rails (Ruby:2.2.1, Rails:4.2.1) and would like to extract data from a specific website and then display it. I use Nokogiri to get content of a web page. What i'm looking for is to fetch all pages of this website and get their content.

Below my code:

doc = Nokogiri::HTML(open("www.google.com").read)
puts doc.at_css('title').text
puts doc.to_html

The code you need is quite complex, and you wrote something like 1% from it. You basically need to go through all links on a page, when you fetch, filter out external links, and store an array of already fetched pages, to avoid duplicate calls. — Yury Lebedev, Jul 28 '15 at 10:31
You should search Stack Overflow. There are many questions along this line. Here are some pointers: http://stackoverflow.com/a/4981595/128421 — the Tin Man, Jul 28 '15 at 21:10

score 0 · Answer 1 · answered Jul 28 '15 at 10:50

This is a very approximate gist of what you need:

class Parser
  attr_accessor :pages

  def fetch_all(host)
    @host = host

    fetch(@host)
  end

  private

  def fetch(url)
    return if pages.any? { |page| page.url == url }
    parse_page(Nokogiri::HTML(open(url).read))
  end

  def parse_page(document)
    links = extract_links(document)

    pages << Page.new(
      url: url,
      title: document.at_css('title').text,
      content: document.to_html,
      links: links
    )

    links.each { |link| fetch(@host + link) }
  end

  def extract_links(document)
    document.css('a').map do |link|
      href = link['href'].gsub(@host, '')
      href if href.start_with?('/')
    end.compact.uniq
  end
end

class Page
  attr_accessor :url, :title, :html_content, :links

  def initialize(url:, title:, html_content:, links:)
    @url = url
    @title = title
    @html_content = html_content
    @links = links
  end
end

How to fetch all pages of specific website with Ruby on Rails

1 Answers1