25

I'm using Nokogiri to extract links from a page but I would like to get the absolute path even though the one on the page is a relative one. How can I accomplish this?

Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382

3 Answers3

60

Nokogiri is unrelated, other than the fact that it gives you the link anchor to begin with. Use Ruby's URI library to manage paths:

absolute_uri = URI.join( page_url, href ).to_s

Seen in action:

require 'uri'

# The URL of the page with the links
page_url = 'http://foo.com/zee/zaw/zoom.html'

# A variety of links to test.
hrefs = %w[
  http://zork.com/             http://zork.com/#id
  http://zork.com/bar          http://zork.com/bar#id
  http://zork.com/bar/         http://zork.com/bar/#id
  http://zork.com/bar/jim.html http://zork.com/bar/jim.html#id
  /bar                         /bar#id
  /bar/                        /bar/#id
  /bar/jim.html                /bar/jim.html#id
  jim.html                     jim.html#id
  ../jim.html                  ../jim.html#id
  ../                          ../#id
  #id
]

hrefs.each do |href|
  root_href = URI.join(page_url,href).to_s
  puts "%-32s -> %s" % [ href, root_href ]
end
#=> http://zork.com/                 -> http://zork.com/
#=> http://zork.com/#id              -> http://zork.com/#id
#=> http://zork.com/bar              -> http://zork.com/bar
#=> http://zork.com/bar#id           -> http://zork.com/bar#id
#=> http://zork.com/bar/             -> http://zork.com/bar/
#=> http://zork.com/bar/#id          -> http://zork.com/bar/#id
#=> http://zork.com/bar/jim.html     -> http://zork.com/bar/jim.html
#=> http://zork.com/bar/jim.html#id  -> http://zork.com/bar/jim.html#id
#=> /bar                             -> http://foo.com/bar
#=> /bar#id                          -> http://foo.com/bar#id
#=> /bar/                            -> http://foo.com/bar/
#=> /bar/#id                         -> http://foo.com/bar/#id
#=> /bar/jim.html                    -> http://foo.com/bar/jim.html
#=> /bar/jim.html#id                 -> http://foo.com/bar/jim.html#id
#=> jim.html                         -> http://foo.com/zee/zaw/jim.html
#=> jim.html#id                      -> http://foo.com/zee/zaw/jim.html#id
#=> ../jim.html                      -> http://foo.com/zee/jim.html
#=> ../jim.html#id                   -> http://foo.com/zee/jim.html#id
#=> ../                              -> http://foo.com/zee/
#=> ../#id                           -> http://foo.com/zee/#id
#=> #id                              -> http://foo.com/zee/zaw/zoom.html#id

The more convoluted answer here previously used URI.parse(root).merge(URI.parse(href)).to_s.
Thanks to @pguardiario for the improvement.

Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • 6
    Nokogiri could be related to this. Here is how: if a html document contains base tag then the solution above won't work correctly. In that case the value of base tag's href attribute should be used instead of page_url. Take a look at the more detailed explanation by @david-thomas here: http://stackoverflow.com/questions/5559578/havling-links-relative-to-root – draganstankovic Sep 15 '12 at 21:21
15

Phrogz' answer is fine but more simply:

URI.join(base, url).to_s
pguardiario
  • 53,827
  • 19
  • 119
  • 159
1

You need check if the URL is absolute or relative with check if begin by http: If the URL is relative you need add the host to this URL. You can't do that by nokogiri. You need process all url inside to render like absolute.

shingara
  • 46,608
  • 11
  • 99
  • 105