Why is my Ruby lookahead regex not working

Question

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I tested my regex in rubular.com and it works, but when I run the code it behaves differently.

I want to parse whole paragraphs out of some HTML code

Here is my regex

description = ad_page.body.scan(/(?<=<span id="preview-local-desc">).+(?=<\/span>)/m)

Here is some of the HTML source

<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>

The match begins where I need it to but then it keeps matching all the way to the end of the document.

score 4 · Accepted Answer · edited May 23 '17 at 12:14

4

Aside from the fact that you shouldn't parse HTML with regex, you want non-greedy matching:

/(?<=<span id="preview-local-desc">).+?(?=<\/span>)/m

edited May 23 '17 at 12:14

Community

1
1

answered Nov 17 '12 at 17:15

Eric

95,302
53
242
374

That worked perfectly and immediately. Thank you very much. – dewet Nov 17 '12 at 17:18
1

Use an HTML parser. I've never used ruby, but I guarantee that one exists, probably in the standard library – Eric Nov 17 '12 at 17:18
I will go look for one. Thank you very much I really appreciate the help. – dewet Nov 17 '12 at 17:19
2

Nokigiri is your friend here: http://nokogiri.org – Brian Nov 17 '12 at 17:20
+1 for link to most awesome SO thread ever. – FK82 Nov 17 '12 at 17:46

score 0 · Answer 2 · answered Nov 17 '12 at 23:30

Parsing XML or HTML with a regex is marginally OK for trivial tasks, if you own or control the file's format. If you don't, then a simple change to the file could break your regex.

Using a parser will avoid that problem; I've parsed some horrible XML with Nokogiri and it didn't even notice. After writing a RSS aggregator that was handling 1000+ feeds I was hooked on using a parser.

require 'nokogiri'

html = '<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>'

doc = Nokogiri.HTML(html)
doc.at('span').text
# => " I want to pick up everything typed here.\n    Paragraphs, everything.\n    "

If there are multiple <span> tags you want:

doc.search('span').map(&:text)
# => [" I want to pick up everything typed here.\n    Paragraphs, everything.\n    "]

If there are multiple <span> tags and you only want this one:

doc.at('span#preview-local-desc').text
# => " I want to pick up everything typed here.\n    Paragraphs, everything.\n    "

Why is my Ruby lookahead regex not working

2 Answers2