I am building a script to parse multiple page titles. Thanks to another question in stack I have now this working bit
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)[1]
puts simian
but if you try the same where a page has no title for example
curl = %x(curl http://zales.1.ai)
it dies with undefined method for nill class as it has no title .... I can't check if curl is nil as it is not in this case (it contains another line)
Do you have any solution to have this working even if the title is not present and move to the next page to check ? I would appreciate if we stick to this code as I did try other solutions with nokogiri and uri (Nokogiri::HTML(open("http:/.....") but this is not working either as subdomains like byname_meee.1.ai do not work with the default open-uri so I am thankful if we can stick to this code that uses curl.
UPDATE
I realize that I probably left out some specific cases that are ought to be clarified. This is for parsing 300-400 pages. In the first run I have noticed at least two cases where nokogiri, hpricot but even the more basic open-uri do not work
1) open-uri simply fails in a simple domain with _ like http://levant_alejandro.1.ai this is a valid domain and works with curl but not with open_uri or nokogiri using open_uri
2)The second case if a page has no title like http://zales.1.ai
3) Third is a page with an image and no valid HTML like http://voldemortas.1.ai/
A fourth case would be a page that has nothing but an internal server error or passenger/rack error.
The first three cases can be sorted with this solution (thanks to Havenwood in #ruby IRC channel)
curl = %x(curl http://voldemortas.1.ai/)
begin
simian = curl.match(/<title>(.*)<\/title>/)[1]
rescue NoMethodError
simian = "" # curl was nil?
rescue ArguementError
simian = "" # not html?
end
puts simian
Now I am aware that this is not elegant nor optimal.
REPHRASED QUESTION
Do you have better way to achieve the same with nokogiri or another gem that includes these cases (no title or no HTML valid page or even 404 page) ? Given that the pages I am parsing have a fairly simple title structure, is the above solution suitable ? For the sake of knowledge it would be useful to know why using an extra gem for the parsing like nokogiri would be better option (note: I try to have few gem dependencies as often and over time they tend to break).