1

I'm trying to grab the href value in <a> HTML tags using Nokogiri.

I want to identify whether they are a path, file, URL, or even a <div> id.

My current work is:

hrefvalue = []
html.css('a').each do |atag|
        hrefvalue << atag['href']
end

The possible values in a href might be:

somefile.html
http://www.someurl.com/somepath/somepath
/some/path/here
#previous

Is there a mechanism to identify whether the value is a valid full URL, or file, or path or others?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
d3t0n4t0r
  • 94
  • 3

3 Answers3

3

try URI:

require 'uri'

URI.parse('somefile.html').path
=> "somefile.html"

URI.parse('http://www.someurl.com/somepath/somepath').path
=> "/somepath/somepath"

URI.parse('/some/path/here').path
=> "/some/path/here"

URI.parse('#previous').path
=> ""
2

Nokogiri is often used with ruby's URI or open-uri, so if that's the case in your situation you'll have access to its methods. You can use that to attempt to parse the URI (using URI.parse). You can also generally use URI.join(base_uri, retrieved_href) to construct the full url, provided you've stored the base_uri.

(Edit/side-note: further details on using URI.join are available here: https://stackoverflow.com/a/4864170/624590 ; do note that URI.join that takes strings as parameters, not URI objects, so coerce where necessary)

Basically, to answer your question

Is there a mechanism to identify whether the value is a valid full url, or file, or path or others?

If the retrieved_href and the base_uri are well formed, and retrieved_href == the joined pair, then it's an absolute path. Otherwise it's relative (again, assuming well formed inputs).

Community
  • 1
  • 1
DRobinson
  • 4,441
  • 22
  • 31
  • Nokogiri doesn't use OpenURI. If a file-handle is passed into a ["parsing"](http://nokogiri.org/Nokogiri/HTML/Document.html#method-c-parse) method, Nokogiri will read from that handle. If you give it a string instead it'll parse that string. OpenURI's `open` method returns something that acts like a file handle. – the Tin Man Oct 22 '12 at 21:35
  • Eek, you're absolutely right, I wrote that poorly. I meant to imply something more along the lines of "people generally use OpenURI alongside nokogiri for retrieving xml/html-to-be-parsed", however in retrospect that isn't a good assumption to make. – DRobinson Oct 23 '12 at 13:11
1

If you use URI to parse the href values, then apply some heuristics to the results, you can figure out what you want to know. This is basically what a browser has to do when it's about to send a request for a page or a resource.

Using your sample strings:

%w[
  somefile.html
  http://www.someurl.com/somepath/somepath
  /some/path/here
  #previous
].each do |u|
  puts URI.parse(u).class
end

Results in:

URI::Generic
URI::HTTP
URI::Generic
URI::Generic

The only one that URI recognizes as a true HTTP URI is "http://www.someurl.com/somepath/somepath". All the others are missing the scheme "http://". (There are many more schemes you could encounter. See the specification for more information.)

Of the generic URIs, you can use some rules to sort through them so you'd know how to react if you have to open them.

If you gathered the HREF strings by scraping a page, you can assume it's safe to use the same scheme and host if the URI in question doesn't supply one. So, if you initially loaded "http://www.someurl.com/index.html", you could use "http://www.someurl.com/" as your basis for further requests.

From there, look inside the strings to determine whether they are anchors, absolute or relative paths. If the string:

  1. Starts with # it's an anchor and would be applied to the current page without any need to reload it.
  2. Doesn't contain a path delimiter /, it's a filename and would be added to the currently retrieved URL, substituting the file name, and retrieved. A nice way to do the substitution is to use File.dirname , File.basename and File.join against the string.
  3. Begins with a path delimiter it's an absolute path and is used to replace the path in the original URL. URI::split and URI::join are your friends here.
  4. Doesn't begin with a path delimiter, it's a relative path and is added to the current URI similarly to #2.

Regarding:

hrefvalue = []
html.css('a').each do |atag|
        hrefvalue << atag['href']
end

I'd use this instead:

hrefvalue = html.search('a').map { |a| a['href'] }

But that's just me.

A final note: URI has some problems with age and needs an update. It's a useful library but, for heavy-duty URI rippin' apart, I highly recommend looking into using Addressable/URI.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • The problem arise when scraping a random web page, never know what output I'll get. But hey, the search() method looks more convenience. Thanks you – d3t0n4t0r Nov 02 '12 at 03:20
  • That's right, every page is different. Even pages within one site can be radically different so, to scrape or analyze pages successfully, you often have to have pretty intimate knowledge of the page's internal layout. I used to write spiders and analyzers for my job and did some huge sites and I can't begin to guess how many times I ranted to my boss over the design of some Fortune 50 sites. – the Tin Man Nov 02 '12 at 14:50