ruby and curl: skipping invalid pages

Question

I am building a script to parse multiple page titles. Thanks to another question in stack I have now this working bit

curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)[1]
puts simian

but if you try the same where a page has no title for example

 curl = %x(curl http://zales.1.ai)

it dies with undefined method for nill class as it has no title .... I can't check if curl is nil as it is not in this case (it contains another line)

Do you have any solution to have this working even if the title is not present and move to the next page to check ? I would appreciate if we stick to this code as I did try other solutions with nokogiri and uri (Nokogiri::HTML(open("http:/.....") but this is not working either as subdomains like byname_meee.1.ai do not work with the default open-uri so I am thankful if we can stick to this code that uses curl.

UPDATE

I realize that I probably left out some specific cases that are ought to be clarified. This is for parsing 300-400 pages. In the first run I have noticed at least two cases where nokogiri, hpricot but even the more basic open-uri do not work

1) open-uri simply fails in a simple domain with _ like http://levant_alejandro.1.ai this is a valid domain and works with curl but not with open_uri or nokogiri using open_uri

2)The second case if a page has no title like http://zales.1.ai

3) Third is a page with an image and no valid HTML like http://voldemortas.1.ai/

A fourth case would be a page that has nothing but an internal server error or passenger/rack error.

The first three cases can be sorted with this solution (thanks to Havenwood in #ruby IRC channel)

curl = %x(curl http://voldemortas.1.ai/)
begin
   simian = curl.match(/<title>(.*)<\/title>/)[1]
rescue NoMethodError
   simian = "" # curl was nil?    
rescue ArguementError
   simian = "" # not html?
end
puts simian

Now I am aware that this is not elegant nor optimal.

REPHRASED QUESTION

Do you have better way to achieve the same with nokogiri or another gem that includes these cases (no title or no HTML valid page or even 404 page) ? Given that the pages I am parsing have a fairly simple title structure, is the above solution suitable ? For the sake of knowledge it would be useful to know why using an extra gem for the parsing like nokogiri would be better option (note: I try to have few gem dependencies as often and over time they tend to break).

it's the output of `match` which is `nil`, not the string. – tokland Sep 07 '12 at 21:53 — tokland, Sep 07 '12 at 21:53

score 2 · Answer 1 · edited May 23 '17 at 12:21

2

You're making it much to hard on yourself.

Nokogiri doesn't care where you get the HTML, it just wants the body of the document. You can use Curb, Open-URI, a raw Net::HTTP connection, and it'll parse the content returned.

Try Curb:

require 'curb'
require 'nokogiri'

doc = Nokogiri::HTML(Curl.get('http://http://odin.1.ai').body_str)
doc.at('title').text
=> "Welcome to Dotgeek.org * 1.ai"

If you don't know whether you'll have a <title> tag, then don't try to do it all at once:

title = doc.at('title')
next if (!title)
puts title.text

Take a look at "equivalent of curl for Ruby?" for more ideas.

edited May 23 '17 at 12:21

Community

1
1

answered Sep 07 '12 at 23:38

the Tin Man

158,662
42
215
303

tin Man: many thanks for this. I hate to sound such a pain but this does not work if the page users non HTML in the index e.g. http://voldemortas.1.ai so curl it is more appropriate in my humble opinion (and I do mean %x(curl...). I am just a bit puzzled on why would it be a better solution to include 2 gems (one for curl which shouldn't be needed and one for nokogiri which is certainly very appropriate for more complex parsing..I would love to use it but it simply doesn't work in some cases as the one mentioned above. – devnull Sep 08 '12 at 02:02
It doesn't make sense to look for a title tag in anything but HTML (or maybe XML). So why not just check the returned content type before trying to parse? – Thilo Sep 08 '12 at 06:34

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

You just need to check for the match before accessing it. If curl.match is nil, the you can't access the grouping:

curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)
simian &&= simian[1] # only access the matched group if available
puts simian

Do heed the Tin Man's advice and use Nokogiri. Your regexp is really only suitable for a brittle solution -- it fails when the title element is spread over multiple lines.

Update

If you really don't want to use an HTML parser and if you promise this is for a quick script, you can use OpenURI (wrapper around net/http) in the standard library. It's at least a little cleaner than parsing curl output.

require 'open-uri'

def extract_title_content(line)
  title = line.match(%r{<title>(.*)</title>})
  title &&= title[1]
end

def extract_title_from(uri)
  title = nil

  open(uri) do |page|
    page.lines.each do |line|
      return title if title = extract_title_content(line)
    end
  end
rescue OpenURI::HTTPError => e
  STDERR.puts "ERROR: Could not download #{uri} (#{e})"
end

puts extract_title_from 'http://odin.1.ai'

I have rephrased the question for clarity but your solution is pretty close to the one I posted on the revised question. — devnull, Sep 08 '12 at 05:50
@devnull: If this is a quick job, I'd go for just parsing the title manually. If you need to do this for more general web pages, then you'll get the immense benefit of *not* having to write regexps for edge-case`TITLE` elements. A rule of thumb I try to follow: use an HTML parser if the job takes more than 15 min to do OR if your code will be maintained by someone else. If the last statement is false...see the updated answer. — jmdeldin, Sep 08 '12 at 06:19

score 0 · Answer 3 · answered Sep 08 '12 at 06:47

What you're really looking for, it seems, is a way to skip non-HTML responses. That's much easier with a curl wrapper like curb, like the Tin Man suggested, than dropping to the shell and using curl there:

1.9.3p125 :001 > require 'curb'
 => true 
1.9.3p125 :002 > response = Curl.get('http://odin.1.ai')
 => #<Curl::Easy http://odin.1.ai?> 
1.9.3p125 :003 > response.content_type
 => "text/html" 
1.9.3p125 :004 > response = Curl.get('http://voldemortas.1.ai')
 => #<Curl::Easy http://voldemortas.1.ai?> 
1.9.3p125 :005 > response.content_type
 => "image/png" 
1.9.3p125 :006 >

So your code could look something like this:

response = Curl.get(url)
if response.content_type == "text/html" # or more fuzzy: =~ /text/
  match = response.body_str.match(/<title>(.*)<\/title>/)
  title = match && match[1] 
  # or use Nokogiri for heavier lifting
end

No more exceptions puts simian

ruby and curl: skipping invalid pages

UPDATE

REPHRASED QUESTION

3 Answers3

Update