How can I get the absolute URL when extracting links using Nokogiri?

Question

I'm using Nokogiri to extract links from a page but I would like to get the absolute path even though the one on the page is a relative one. How can I accomplish this?

Phrogz · Accepted Answer · 2012-10-15T18:05:27.793

Nokogiri is unrelated, other than the fact that it gives you the link anchor to begin with. Use Ruby's URI library to manage paths:

absolute_uri = URI.join( page_url, href ).to_s

Seen in action:

require 'uri'

# The URL of the page with the links
page_url = 'http://foo.com/zee/zaw/zoom.html'

# A variety of links to test.
hrefs = %w[
  http://zork.com/             http://zork.com/#id
  http://zork.com/bar          http://zork.com/bar#id
  http://zork.com/bar/         http://zork.com/bar/#id
  http://zork.com/bar/jim.html http://zork.com/bar/jim.html#id
  /bar                         /bar#id
  /bar/                        /bar/#id
  /bar/jim.html                /bar/jim.html#id
  jim.html                     jim.html#id
  ../jim.html                  ../jim.html#id
  ../                          ../#id
  #id
]

hrefs.each do |href|
  root_href = URI.join(page_url,href).to_s
  puts "%-32s -> %s" % [ href, root_href ]
end
#=> http://zork.com/                 -> http://zork.com/
#=> http://zork.com/#id              -> http://zork.com/#id
#=> http://zork.com/bar              -> http://zork.com/bar
#=> http://zork.com/bar#id           -> http://zork.com/bar#id
#=> http://zork.com/bar/             -> http://zork.com/bar/
#=> http://zork.com/bar/#id          -> http://zork.com/bar/#id
#=> http://zork.com/bar/jim.html     -> http://zork.com/bar/jim.html
#=> http://zork.com/bar/jim.html#id  -> http://zork.com/bar/jim.html#id
#=> /bar                             -> http://foo.com/bar
#=> /bar#id                          -> http://foo.com/bar#id
#=> /bar/                            -> http://foo.com/bar/
#=> /bar/#id                         -> http://foo.com/bar/#id
#=> /bar/jim.html                    -> http://foo.com/bar/jim.html
#=> /bar/jim.html#id                 -> http://foo.com/bar/jim.html#id
#=> jim.html                         -> http://foo.com/zee/zaw/jim.html
#=> jim.html#id                      -> http://foo.com/zee/zaw/jim.html#id
#=> ../jim.html                      -> http://foo.com/zee/jim.html
#=> ../jim.html#id                   -> http://foo.com/zee/jim.html#id
#=> ../                              -> http://foo.com/zee/
#=> ../#id                           -> http://foo.com/zee/#id
#=> #id                              -> http://foo.com/zee/zaw/zoom.html#id

The more convoluted answer here previously used URI.parse(root).merge(URI.parse(href)).to_s.
Thanks to @pguardiario for the improvement.

Nokogiri could be related to this. Here is how: if a html document contains base tag then the solution above won't work correctly. In that case the value of base tag's href attribute should be used instead of page_url. Take a look at the more detailed explanation by @david-thomas here: http://stackoverflow.com/questions/5559578/havling-links-relative-to-root — draganstankovic, Sep 15 '12 at 21:21

score 15 · Answer 2 · answered Jan 04 '12 at 06:50

15

Phrogz' answer is fine but more simply:

URI.join(base, url).to_s

answered Jan 04 '12 at 06:50

pguardiario

53,827
19
119
159

2

Can you give an example of what base and url are? – lulalala Jun 28 '13 at 02:56
2

`base = "http://www.google.com/somewhere"; url= '/over/there';` I believe pguardino's variable names are a little imprecise – dancow Dec 08 '13 at 21:18

score 1 · Answer 3 · answered Feb 01 '11 at 11:08

1

You need check if the URL is absolute or relative with check if begin by http: If the URL is relative you need add the host to this URL. You can't do that by nokogiri. You need process all url inside to render like absolute.

answered Feb 01 '11 at 11:08

shingara

46,608
11
99
105

How can I get the absolute URL when extracting links using Nokogiri?

3 Answers3

Linked