Why It Is Broken
In Ruby (as with most regex implementations) the .
matches any character except a newline unless you turn on the "multiline" mode:
irb(main):003:0> "foo\nbar"[/.+/]
#=> "foo"
irb(main):004:0> "foo\nbar"[/.+/m]
#=> "foo\nbar"
As the official Ruby 1.9 regex documentation states:
The following metacharacters also behave like character classes:
/./
- Any character except a newline.
/./m
- Any character (the m modifier enables multiline mode)
When your code explicitly consumed the \n
all worked well, but when you switched it to just .*
it could not match the \n
and thus could not continue on to match href
.
Fixing it Better
Instead of using regex to ~parse and consume HTML, it's better to use a real HTML parser:
require 'nokogiri' # gem install nokogiri
doc = Nokogiri.HTML( my_html_string )
# Find it using XPath...
first_hello_link = doc.at('//a[starts-with(@href,"http://hello.com")]')
# ...or using CSS
first_hello_link = doc.at('a[href^="http://hello.com"]')
With this your code can robustly handle HTML with:
- spaces before or after the equals sign
- additional attributes appearing before href
- quoting with either
"
or '
- mixed capitalization
- things that look like links but aren't (e.g. in a comment or script block)