0

I'm currently building a website using Ruby on Rails (Ruby:2.2.1, Rails:4.2.1) and would like to extract data from a specific website and then display it. I use Nokogiri to get content of a web page. What i'm looking for is to fetch all pages of this website and get their content.

Below my code:

doc = Nokogiri::HTML(open("www.google.com").read)
puts doc.at_css('title').text
puts doc.to_html
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
Maria Minh
  • 1,219
  • 4
  • 15
  • 27
  • 1
    The code you need is quite complex, and you wrote something like 1% from it. You basically need to go through all links on a page, when you fetch, filter out external links, and store an array of already fetched pages, to avoid duplicate calls. – Yury Lebedev Jul 28 '15 at 10:31
  • You should search Stack Overflow. There are many questions along this line. Here are some pointers: http://stackoverflow.com/a/4981595/128421 – the Tin Man Jul 28 '15 at 21:10

1 Answers1

0

This is a very approximate gist of what you need:

class Parser
  attr_accessor :pages

  def fetch_all(host)
    @host = host

    fetch(@host)
  end

  private

  def fetch(url)
    return if pages.any? { |page| page.url == url }
    parse_page(Nokogiri::HTML(open(url).read))
  end

  def parse_page(document)
    links = extract_links(document)

    pages << Page.new(
      url: url,
      title: document.at_css('title').text,
      content: document.to_html,
      links: links
    )

    links.each { |link| fetch(@host + link) }
  end

  def extract_links(document)
    document.css('a').map do |link|
      href = link['href'].gsub(@host, '')
      href if href.start_with?('/')
    end.compact.uniq
  end
end

class Page
  attr_accessor :url, :title, :html_content, :links

  def initialize(url:, title:, html_content:, links:)
    @url = url
    @title = title
    @html_content = html_content
    @links = links
  end
end
Yury Lebedev
  • 3,985
  • 19
  • 26