How do i make allowance for muliple lines when doing a string.scan

Question

Here is an excerpt from the html I want to scan through.

<div class="text">
 <h3>
  <a href="http://www.faith.co.uk/">
   Rodeo Sinclair
  </a>
 </h3>

And here is my ruby code.

@doc = open(url) { |f| 
  @doc = f.read
}

output = @doc.scan(/<h3><a href=(.*?)>/)

This does not work because of the new lines and spaces in the html file. Is there anyway I can get around this?

score 2 · Accepted Answer · edited May 23 '17 at 12:03

2

I could easily create a regular expression that would parse your HTML fragment.

However, I would like to encourage you to get in the habit of using an XML/HTML parser to interact with HTML.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open(url))

output = doc.css('div h3 a').each do |link|
    puts link.attr("href")
end

See RegEx match open tags except XHTML self-contained tags for a compelling argument against using regular expressions to parse HTML.

==EDIT== changed to an each loop

edited May 23 '17 at 12:03

Community

answered Feb 10 '12 at 05:16

ironchefpython

Thanks, would this grab all instances of that pattern in the HTML? – bolshevik Feb 10 '12 at 10:29
@bolshevik I changed it to an each loop to show how you'd get the href of each matching link – ironchefpython Feb 10 '12 at 15:49

score 1 · Answer 2 · answered Feb 10 '12 at 05:11

1

Add (optional) spaces to the match:

@doc.scan(/<h3>\s*<a href=(.*?)>/)

answered Feb 10 '12 at 05:11

Sophie Alpert

2 Answers2