0

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I tested my regex in rubular.com and it works, but when I run the code it behaves differently.

I want to parse whole paragraphs out of some HTML code

Here is my regex

description = ad_page.body.scan(/(?<=<span id="preview-local-desc">).+(?=<\/span>)/m)

Here is some of the HTML source

<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>

The match begins where I need it to but then it keeps matching all the way to the end of the document.

Community
  • 1
  • 1
dewet
  • 376
  • 1
  • 3
  • 16

2 Answers2

4

Aside from the fact that you shouldn't parse HTML with regex, you want non-greedy matching:

/(?<=<span id="preview-local-desc">).+?(?=<\/span>)/m
Community
  • 1
  • 1
Eric
  • 95,302
  • 53
  • 242
  • 374
0

Parsing XML or HTML with a regex is marginally OK for trivial tasks, if you own or control the file's format. If you don't, then a simple change to the file could break your regex.

Using a parser will avoid that problem; I've parsed some horrible XML with Nokogiri and it didn't even notice. After writing a RSS aggregator that was handling 1000+ feeds I was hooked on using a parser.

require 'nokogiri'

html = '<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>'

doc = Nokogiri.HTML(html)
doc.at('span').text
# => " I want to pick up everything typed here.\n    Paragraphs, everything.\n    "

If there are multiple <span> tags you want:

doc.search('span').map(&:text)
# => [" I want to pick up everything typed here.\n    Paragraphs, everything.\n    "]

If there are multiple <span> tags and you only want this one:

doc.at('span#preview-local-desc').text
# => " I want to pick up everything typed here.\n    Paragraphs, everything.\n    "
the Tin Man
  • 158,662
  • 42
  • 215
  • 303