I'm using Nokogiri to extract links from a page but I would like to get the absolute path even though the one on the page is a relative one. How can I accomplish this?
Asked
Active
Viewed 1.4k times
3 Answers
60
Nokogiri is unrelated, other than the fact that it gives you the link anchor to begin with. Use Ruby's URI library to manage paths:
absolute_uri = URI.join( page_url, href ).to_s
Seen in action:
require 'uri'
# The URL of the page with the links
page_url = 'http://foo.com/zee/zaw/zoom.html'
# A variety of links to test.
hrefs = %w[
http://zork.com/ http://zork.com/#id
http://zork.com/bar http://zork.com/bar#id
http://zork.com/bar/ http://zork.com/bar/#id
http://zork.com/bar/jim.html http://zork.com/bar/jim.html#id
/bar /bar#id
/bar/ /bar/#id
/bar/jim.html /bar/jim.html#id
jim.html jim.html#id
../jim.html ../jim.html#id
../ ../#id
#id
]
hrefs.each do |href|
root_href = URI.join(page_url,href).to_s
puts "%-32s -> %s" % [ href, root_href ]
end
#=> http://zork.com/ -> http://zork.com/
#=> http://zork.com/#id -> http://zork.com/#id
#=> http://zork.com/bar -> http://zork.com/bar
#=> http://zork.com/bar#id -> http://zork.com/bar#id
#=> http://zork.com/bar/ -> http://zork.com/bar/
#=> http://zork.com/bar/#id -> http://zork.com/bar/#id
#=> http://zork.com/bar/jim.html -> http://zork.com/bar/jim.html
#=> http://zork.com/bar/jim.html#id -> http://zork.com/bar/jim.html#id
#=> /bar -> http://foo.com/bar
#=> /bar#id -> http://foo.com/bar#id
#=> /bar/ -> http://foo.com/bar/
#=> /bar/#id -> http://foo.com/bar/#id
#=> /bar/jim.html -> http://foo.com/bar/jim.html
#=> /bar/jim.html#id -> http://foo.com/bar/jim.html#id
#=> jim.html -> http://foo.com/zee/zaw/jim.html
#=> jim.html#id -> http://foo.com/zee/zaw/jim.html#id
#=> ../jim.html -> http://foo.com/zee/jim.html
#=> ../jim.html#id -> http://foo.com/zee/jim.html#id
#=> ../ -> http://foo.com/zee/
#=> ../#id -> http://foo.com/zee/#id
#=> #id -> http://foo.com/zee/zaw/zoom.html#id
The more convoluted answer here previously used URI.parse(root).merge(URI.parse(href)).to_s
.
Thanks to @pguardiario for the improvement.

Phrogz
- 296,393
- 112
- 651
- 745
-
6Nokogiri could be related to this. Here is how: if a html document contains base tag then the solution above won't work correctly. In that case the value of base tag's href attribute should be used instead of page_url. Take a look at the more detailed explanation by @david-thomas here: http://stackoverflow.com/questions/5559578/havling-links-relative-to-root – draganstankovic Sep 15 '12 at 21:21
15
Phrogz' answer is fine but more simply:
URI.join(base, url).to_s

pguardiario
- 53,827
- 19
- 119
- 159
-
2
-
2`base = "http://www.google.com/somewhere"; url= '/over/there';` I believe pguardino's variable names are a little imprecise – dancow Dec 08 '13 at 21:18
1
You need check if the URL is absolute or relative with check if begin by http:
If the URL is relative you need add the host to this URL. You can't do that by nokogiri. You need process all url inside to render like absolute.

shingara
- 46,608
- 11
- 99
- 105