0

Here is an excerpt from the html I want to scan through.

<div class="text">
 <h3>
  <a href="http://www.faith.co.uk/">
   Rodeo Sinclair
  </a>
 </h3>

And here is my ruby code.

@doc = open(url) { |f| 
  @doc = f.read
}

output = @doc.scan(/<h3><a href=(.*?)>/) 

This does not work because of the new lines and spaces in the html file. Is there anyway I can get around this?

bolshevik
  • 153
  • 1
  • 10

2 Answers2

2

I could easily create a regular expression that would parse your HTML fragment.

However, I would like to encourage you to get in the habit of using an XML/HTML parser to interact with HTML.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open(url))

output = doc.css('div h3 a').each do |link|
    puts link.attr("href")
end

See RegEx match open tags except XHTML self-contained tags for a compelling argument against using regular expressions to parse HTML.

==EDIT== changed to an each loop

Community
  • 1
  • 1
ironchefpython
  • 3,409
  • 1
  • 19
  • 32
1

Add (optional) spaces to the match:

@doc.scan(/<h3>\s*<a href=(.*?)>/) 
Sophie Alpert
  • 139,698
  • 36
  • 220
  • 238