0

In one of my projects, the application has to check that a link to a given URL exists in a given page. Today a user reported an error. This was the link that the application was not detecting:

  <a\nhref="http://hello.com"...

I tried to test why it was not working, and here is where the strange behavior appeared. This Regexp matches the link:

/\<a.*\nhref=\"http:\/\/hello.com/

But this does not:

/\<a.*href=\"http:\/\/hello.com/

I guess it has some relation with the Ruby version I'm using (1.9.3), as Rubular matches the last regexp.

Wayne Conrad
  • 103,207
  • 26
  • 155
  • 191
Gawyn
  • 1,156
  • 1
  • 10
  • 21
  • 2
    Do not use regular expressions to parse an HTML document. – KARASZI István May 07 '12 at 15:37
  • @KARASZIIstván: This is one occasion where regex is actually (possibly) OK. Ill-advised, but [Zalgo](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) won't consume you just for using regex to check if a link exists in a document. – Li-aung Yip May 07 '12 at 15:42
  • 2
    @Li-aungYip Unless, you know, it's in an HTML comment. – Phrogz May 07 '12 at 15:45

2 Answers2

4

Why It Is Broken

In Ruby (as with most regex implementations) the . matches any character except a newline unless you turn on the "multiline" mode:

irb(main):003:0> "foo\nbar"[/.+/]
#=> "foo"

irb(main):004:0> "foo\nbar"[/.+/m]
#=> "foo\nbar"

As the official Ruby 1.9 regex documentation states:

The following metacharacters also behave like character classes:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)

When your code explicitly consumed the \n all worked well, but when you switched it to just .* it could not match the \n and thus could not continue on to match href.

Fixing it Better

Instead of using regex to ~parse and consume HTML, it's better to use a real HTML parser:

require 'nokogiri' # gem install nokogiri
doc = Nokogiri.HTML( my_html_string )

# Find it using XPath...
first_hello_link = doc.at('//a[starts-with(@href,"http://hello.com")]')

# ...or using CSS
first_hello_link = doc.at('a[href^="http://hello.com"]')

With this your code can robustly handle HTML with:

  • spaces before or after the equals sign
  • additional attributes appearing before href
  • quoting with either " or '
  • mixed capitalization
  • things that look like links but aren't (e.g. in a comment or script block)
Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • Yes, I probably should use Nokogiri. But it's a classic: you start with an extremely simple parser and it starts to grow more and more until Nokogiri starts to be almost mandatory. – Gawyn May 07 '12 at 16:05
  • 1
    @Christian - It won't take too many of these experiences before you just pull out Nokogiri any time you have xml or html in your hands, no matter how simple the parsing seems. It _always_ grows to be too complex for a regexp. – Wayne Conrad May 08 '12 at 03:38
1

Regexps in ruby don't match newline characters by default, you must add the m modifier:

/pat/m - Treat a newline as a character matched by .

Take a look at the options section:

http://www.ruby-doc.org/core-1.9.3/Regexp.html

josepjaume
  • 419
  • 5
  • 15
  • It's only the dot metacharacter (`.`) that doesn't match newlines by default. If there were no dots in the regex, you wouldn't need the `m` flag. – Alan Moore May 07 '12 at 20:52