4

I am trying to get what's inside of the title tag but I can't get to do it. I am following some of the answers around stackoverflow that are supposed to work but for me they don't.

This is what I am doing:

require "open-uri"
require "uri"

def browse startpage, depth, block
    if depth > 0
        begin 
            open(startpage){ |f|
                block.call startpage, f
            }
        rescue
            return
        end
    end
end

browse("https://www.ruby-lang.org/es/", 2, lambda { |page_name, web|
    puts "Header information:"
    puts "Title: #{web.to_s.scan(/<title>(.*?)<\/title>/)}"
    puts "Base URI: #{web.base_uri}"
    puts "Content Type: #{web.content_type}"
    puts "Charset: #{web.charset}"
    puts "-----------------------------"
})

The title output is just [], why?

dabadaba
  • 9,064
  • 21
  • 85
  • 155
  • 2
    Are you trying to do this using open-uri only? Why not use something like Nokogiri? – daremkd Nov 06 '14 at 12:00
  • @daremkd yeah I saw Nokogiri serves this purpose, but I want to do it this way and I want to know why I am getting an empty list as the title. After all that is solved, a Nokogiri solution as an extra tip could be great as well. – dabadaba Nov 06 '14 at 12:14
  • 1
    It's highly discouraged you use regex to parse HTML tags. There could be thousands of nuances on ANY web page that could cause your regex not to work. – daremkd Nov 06 '14 at 12:16

2 Answers2

10

open returns a File object or passes it to the block (actually a Tempfile but that doesn't matter). Calling to_s just returns a string containing the object's class and its id:

open('https://www.ruby-lang.org/es/') do |f|
  f.to_s
end
#=> "#<File:0x007ff8e23bfb68>"

Scanning that string for a title is obviously useless:

"#<File:0x007ff8e23bfb68>".scan(/<title>(.*?)<\/title>/)

Instead, you have to read the file's content:

open('https://www.ruby-lang.org/es/') do |f|
  f.read
end
#=> "<!DOCTYPE html>\n<html>\n...</html>\n"

You can now scan the content for a <title> tag:

open('https://www.ruby-lang.org/es/') do |f|
  str = f.read
  str.scan(/<title>(.*?)<\/title>/)
end
#=> [["Lenguaje de Programaci\xC3\xB3n Ruby"]]

or, using Nokogiri: (because You can't parse [X]HTML with regex)

open('https://www.ruby-lang.org/es/') do |f|
  doc = Nokogiri::HTML(f)
  doc.at_css('title').text
end
#=> "Lenguaje de Programación Ruby"
Community
  • 1
  • 1
Stefan
  • 109,145
  • 14
  • 143
  • 218
1

If you must insist on using open-uri, this one liner than get you the page title:

2.1.4 :008 > puts open('https://www.ruby-lang.org/es/').read.scan(/<title>(.*?)<\/title>/)
Lenguaje de Programación Ruby
 => nil

If you want to use something more complicated than this, please use nokogiri or mechanize. Thanks

CuriousMind
  • 33,537
  • 28
  • 98
  • 137
  • `Nokogiri::HTML(open('https://www.ruby-lang.org/es/')).css('title').text` isn't that complicated – Stefan Nov 06 '14 at 12:34
  • actually I prefer the way I was doing it because with `URL.extract` I get the link with absolute form, not relative. With Nokogiri I am getting the `href` and I want the whole link because the page will be processed. However, your solution with `open().read.scan()` is the same than I was doing. `read` is the same than `web.to_s` that I was doing. And I still have the same problem: the title is an empty list. – dabadaba Nov 06 '14 at 12:44