Strange behavior of regular expressions in Ruby

Question

In one of my projects, the application has to check that a link to a given URL exists in a given page. Today a user reported an error. This was the link that the application was not detecting:

  <a\nhref="http://hello.com"...

I tried to test why it was not working, and here is where the strange behavior appeared. This Regexp matches the link:

/\<a.*\nhref=\"http:\/\/hello.com/

But this does not:

/\<a.*href=\"http:\/\/hello.com/

I guess it has some relation with the Ruby version I'm using (1.9.3), as Rubular matches the last regexp.

@KARASZIIstván: This is one occasion where regex is actually (possibly) OK. Ill-advised, but [Zalgo](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) won't consume you just for using regex to check if a link exists in a document. — Li-aung Yip, May 07 '12 at 15:42

Phrogz · Accepted Answer · 2012-05-07T16:46:27.750

Why It Is Broken

In Ruby (as with most regex implementations) the . matches any character except a newline unless you turn on the "multiline" mode:

irb(main):003:0> "foo\nbar"[/.+/]
#=> "foo"

irb(main):004:0> "foo\nbar"[/.+/m]
#=> "foo\nbar"

As the official Ruby 1.9 regex documentation states:

The following metacharacters also behave like character classes:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)

When your code explicitly consumed the \n all worked well, but when you switched it to just .* it could not match the \n and thus could not continue on to match href.

Fixing it Better

Instead of using regex to ~parse and consume HTML, it's better to use a real HTML parser:

require 'nokogiri' # gem install nokogiri
doc = Nokogiri.HTML( my_html_string )

# Find it using XPath...
first_hello_link = doc.at('//a[starts-with(@href,"http://hello.com")]')

# ...or using CSS
first_hello_link = doc.at('a[href^="http://hello.com"]')

With this your code can robustly handle HTML with:

spaces before or after the equals sign
additional attributes appearing before href
quoting with either " or '
mixed capitalization
things that look like links but aren't (e.g. in a comment or script block)

Yes, I probably should use Nokogiri. But it's a classic: you start with an extremely simple parser and it starts to grow more and more until Nokogiri starts to be almost mandatory. — Gawyn, May 07 '12 at 16:05
@Christian - It won't take too many of these experiences before you just pull out Nokogiri any time you have xml or html in your hands, no matter how simple the parsing seems. It _always_ grows to be too complex for a regexp. — Wayne Conrad, May 08 '12 at 03:38

score 1 · Answer 2 · answered May 07 '12 at 15:43

1

Regexps in ruby don't match newline characters by default, you must add the m modifier:

/pat/m - Treat a newline as a character matched by .

Take a look at the options section:

http://www.ruby-doc.org/core-1.9.3/Regexp.html

answered May 07 '12 at 15:43

josepjaume

419
5
15

It's only the dot metacharacter (`.`) that doesn't match newlines by default. If there were no dots in the regex, you wouldn't need the `m` flag. – Alan Moore May 07 '12 at 20:52

Strange behavior of regular expressions in Ruby

2 Answers2

Why It Is Broken

Fixing it Better