1

Possible Duplicate:
simple parsing in ruby

I am trying to verify a title in a website and after some trial and error I have found that this can be done in ruby by using nokogiri and rest-client

 require 'nokogiri'
 require 'rest-client'

 page = Nokogiri::HTML(RestClient.get("http:/#{user.username}.domain.com/"))   
 simian = page.at_css("title").text 
     if simian == "Welcome to"
       puts "default monkey" 
   else 
   puts "website updated"       
    end

unfortunately for a large number of websites this doesn't always seems to work as it returns RestClient::InternalServerError at /admin/users/list 500 Internal Server Error

I was wondering if there is any option to achieve the same by simply using mycurl = %x(curl http://........) what would be an efficient way to use that by parsing the title and without using any gem or can the curl option be used directly with nokogiri ? Thanks

Community
  • 1
  • 1
devnull
  • 2,752
  • 1
  • 21
  • 38

2 Answers2

4

After reading your question wasn't really sure if you are set with those 2 gems or not, here is another way that may prove simpler.

require 'open-uri'

url="http://google.com"
source = open(url).read
source[/<title>(.*)<\/title>, 1]
Rummy
  • 105
  • 5
1

There's two parts to this. One is fetching the page and the other is parsing. For fetching, you don't really need the rest-client gem, when open-uri from the standard library will do. Nokogiri does the parsing, and it is not likely your problem. Try this:

require 'open-uri'
require 'nokogiri'

page = Nokogiri::HTML(open('http://example.com/'))
puts page.at('title').text
Mark Thomas
  • 37,131
  • 11
  • 74
  • 101
  • hello this works and many thanks - the only problem is if, for example, I am trying to open a page that in lieu of the index has no index and the server shows an internal server error the script dies with OpenURI::HTTPError at /admin/users/ 500 Internal Server Error /ruby-1.9.3-p125/lib/ruby/1.9.1/open-uri.rb: in open_http raise OpenURI::HTTPError.new(io.status.join(' '), io)... any hint on how to skip these pages would be much appreciated !! – devnull Sep 07 '12 at 21:08
  • does not work https://gist.github.com/3670329 URI cannot handle simple _ in subdomains so I have to use curl – devnull Sep 07 '12 at 22:49
  • @devnull In your first comment, the server is responding with the error, it's not coming from the open-uri code. As for your second comment, `_` is an invalid character in a domain name. – Mark Thomas Sep 08 '12 at 13:17
  • Re-reading your first comment, I guess you want to gracefully skip when open-uri throws an exception? Put the code in a `rescue` block. – Mark Thomas Sep 09 '12 at 21:17
  • _ is not an invalid character in a sub domain like me_me.1.ai and open-uri doesn't work with that even if is a completely legitimate sub domain – devnull Sep 09 '12 at 21:35