12

I'm trying to retrieve every external link of a webpage using Ruby. I'm using String.scan with this regex:

/href="https?:[^"]*|href='https?:[^']*/i

Then, I can use gsub to remove the href part:

str.gsub(/href=['"]/)

This works fine, but I'm not sure if it's efficient in terms of performance. Is this OK to use or I should work with a more specific parser (nokogiri, for example)? Which way is better?

Thanks!

Fábio Perez
  • 23,850
  • 22
  • 76
  • 100
  • 4
    Please don't try to parse HTML with regular expressions, an HTML parser will serve you better. – mu is too short Jul 14 '11 at 22:19
  • 1
    Because HTML parsing is more complicated than you probably think it is and there is a lot of broken HTML out there that simple regular expressions won't handle: http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491 – mu is too short Jul 15 '11 at 02:42
  • actually, in this simple case I would expect the regex solution to be more robust than the parsing solution. I would replace [^"] by [^" >] though. I also would expect it to be quite a bit faster. But it depends a bit on the goal. If this goes in a production system which has got to work for years i'd go for a parser, if it's a script for own usage, definitely regex. – markijbema Jul 16 '11 at 00:36
  • 2
    In all cases I'd expect a simple parsing solution to be more robust than a simple regex solution. :) – Mark Thomas Jul 16 '11 at 17:27
  • @markijbema, non-contrived issues we see often in HTML would be a space at the `=`, or missing single-quotes, or use of double-quotes. Even within a single document from one creator those things occur often. More complex regex can be written to handle that, but a parser will do it without a problem. – the Tin Man Jul 16 '11 at 22:57

5 Answers5

17

Using regular expressions is fine for a quick and dirty script, but Nokogiri is very simple to use:

require 'nokogiri'
require 'open-uri'

fail("Usage: extract_links URL [URL ...]") if ARGV.empty?

ARGV.each do |url|
  doc = Nokogiri::HTML(open(url))
  hrefs = doc.css("a").map do |link|
    if (href = link.attr("href")) && !href.empty?
      URI::join(url, href)
    end
  end.compact.uniq
  STDOUT.puts(hrefs.join("\n"))
end

If you want just the method, refactor it a little bit to your needs:

def get_links(url)
  Nokogiri::HTML(open(url).read).css("a").map do |link|
    if (href = link.attr("href")) && href.match(/^https?:/)
      href
    end
  end.compact
end
tokland
  • 66,169
  • 13
  • 144
  • 170
  • Can you explain me the advantages? The code looks more complicated than with regex and scan. I'm also curious to know which solution is faster. – Fábio Perez Jul 14 '11 at 22:22
  • @tokland, I think you want Nokogiri::HTML. Also note the requirement to extract only absolute links. – Mark Thomas Jul 14 '11 at 22:29
7

I'm a big fan of Nokogiri, but why reinvent the wheel?

Ruby's URI module already has the extract method to do this:

URI::extract(str[, schemes][,&blk])

From the docs:

Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.

require "uri"

URI.extract("text here http://foo.example.org/bla and here mailto:test@example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test@example.com"]

You could use Nokogiri to walk the DOM and pull all the tags that have URLs, or have it retrieve just the text and pass it to URI.extract, or just let URI.extract do it all.

And, why use a parser, such as Nokogiri, instead of regex patterns? Because HTML, and XML, can be formatted in a lot of different ways and still render correctly on the page or effectively transfer the data. Browsers are very forgiving when it comes to accepting bad markup. Regex patterns, on the other hand, work in very limited ranges of "acceptability", where that range is defined by how well you anticipate the variations in the markup, or, conversely, how well you anticipate the ways your pattern can go wrong when presented with unexpected patterns.

A parser doesn't work like a regex. It builds an internal representation of the document and then walks through that. It doesn't care how the file/markup is laid out, it does its work on the internal representation of the DOM. Nokogiri relaxes its parsing to handle HTML, because HTML is notorious for being poorly written. That helps us because with most non-validating HTML Nokogiri can fix it up. Occasionally I'll encounter something that is SO badly written that Nokogiri can't fix it correctly, so I'll have to give it a minor nudge by tweaking the HTML before I pass it to Nokogiri; I'll still use the parser though, rather than try to use patterns.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
6

Mechanize uses Nokogiri under the hood but has built-in niceties for parsing HTML, including links:

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://example.com/')

page.links_with(:href => /^https?/).each do |link|
  puts link.href
end

Using a parser is generally always better than using regular expressions for parsing HTML. This is an often-asked question here on Stack Overflow, with this being the most famous answer. Why is this the case? Because constructing a robust regular expression that can handle real-world variations of HTML, some valid some not, is very difficult and ultimately more complicated than a simple parsing solution that will work for just about all pages that will render in a browser.

Community
  • 1
  • 1
Mark Thomas
  • 37,131
  • 11
  • 74
  • 101
  • I agree that when you need to parse html you don't want to use regexes. But in this case I think a regex would suffice, since you don't get into trouble with the nonregularity of html (since there's no recursiveness involved). Could you think of a (non-contrived) example where this regex (with my improvement as mentioned in my comment on the question) would fail? – markijbema Jul 16 '11 at 00:38
  • I do like your solution better btw, it's short and readable, but I don't really like does over-absolute truths, like 'thou shall not touch html with regexes'. – markijbema Jul 16 '11 at 00:39
  • @markijbema I've added a bit to explain. Here's one case I've seen: `foo`. Also sometimes there are newlines in there. – Mark Thomas Jul 16 '11 at 15:34
4

why you dont use groups in your pattern? e.g.

/http[s]?:\/\/(.+)/i

so the first group will already be the link you searched for.

gorootde
  • 4,003
  • 4
  • 41
  • 83
1

Can you put groups in your regex? That would reduce your regular expressions to 1 instead of 2.

RobotRock
  • 4,211
  • 6
  • 46
  • 86