0

I'm trying to grab the full HTML of a webpage when saving a URL. Here is my model where I'd tried to have a go at writing the method.

class Page < ActiveRecord::Base
  def processPages(page_url)
    open(page_url) do |uri|
      html = uri.read
      create!( html => page.html )
    end
  end
end

I'm trying to put the raw html which exists in html into a property of my page object, but can't get my head around how to save the content.

I'm also struggling to call processPages from within my controller on the create action which at the moment is basic scaffolding.

calabi
  • 283
  • 5
  • 18
  • What does your model look like? – Justin Wood Mar 22 '14 at 17:43
  • Hello @JustinWood, thanks for the reply. The model is above Page with two properties :url and :html – calabi Mar 22 '14 at 17:56
  • I am not clear with your question ...do you want to enter a url of page though a form and want to save the html of that page in data base ? or something else plz explain! – neo-code Mar 22 '14 at 18:07
  • @Mayank Sorry I should have explained better, in my view I have scaffolding to enter a URL into the database but on save of the Page entry I'd like to run this method to grab the HTML of the page and put it in the html field in the database. But I'm not sure how to get this working. I think I might need to go back and get a better understanding on the models. – calabi Mar 22 '14 at 18:13

1 Answers1

2

There are many ways to do that, i would do it using an after_save model callback, so fetching the html is done in the background and thecontroller stays clean.

class Page < ActiveRecord::Base
  require 'open-uri'

  after_save: process_pages

  def process_pages
    self.html = open(self.url).read
    self.save # note, this will check model validations, if want to skip it try model,update_attribute method
  end
end

Since, url and html are botth Page attributes, no need to pass anything to the method & from this SO question you can find more about html fetching.

ah, and ProcessPages really dose not look like ruby ! so i changed it to process_pages instead.

Update:

If you need to parse the page contents, you can use Nokogiri, if you need to submit a form or something, you can use Mechanize, as for simple html fetching ... open-uri will do the job

Community
  • 1
  • 1
Nimir
  • 5,727
  • 1
  • 26
  • 34
  • @calbi this is nice solution go with it! – neo-code Mar 22 '14 at 18:18
  • Thank you very much! Really helpful and the explanation as well helps to understand. Will nokogiri make the request more efficient? – calabi Mar 22 '14 at 18:54
  • Glad it worked, now both `open-uri` and `Net::HTTP` and tens of other libraries will do the job, however you can read the discussion here (http://stackoverflow.com/questions/929652/equivalent-of-curl-for-ruby) .. at least (open-uri and Net::HTTP) are not recommended. Performance wise , i am sure you will find some benchmark for each but if your task is simple one page fetching, i would say .. keep it simple – Nimir Mar 23 '14 at 06:22
  • @Nimir Sorry I should have tested this better yesterday but I'm getting an error undefined method `html' for # any ideas? – calabi Mar 23 '14 at 13:06
  • @calabi sorry, my bad. No `html` function needed and the http://nokogiri.org/Nokogiri/HTML.html class will parse the html content read by `open-uri`. Test and let me know if it works! – Nimir Mar 23 '14 at 14:40
  • @Nimir Ah that silences the error but doesn't save the output from nokogiri into the html attribute in my database? Do I need to call a save as well? Thanks for all your help on this! – calabi Mar 23 '14 at 14:46
  • @calabi ops, my bad again , you are right, you need to call save. Its a `before_save` callback that does not need to call save. i will edit my answer – Nimir Mar 23 '14 at 14:54
  • @Nimir I now get an error :( TypeError: can't cast Nokogiri::HTML::Document to text: UPDATE "pages" SET "html" = ?, "updated_at" = ? WHERE "pages"."id" = 1 – calabi Mar 23 '14 at 15:03