0

Could anyone tell me how can I match the start of <div> tag to the end of </div> tag with a regular expression in Ruby?

For example let say I have a:

<div>
<p>test content</p>
</div>

So far I have this:

< div [^>]* > [^<]*<\/div>

but it doesn't seems to work.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
user486174
  • 41
  • 1
  • 10
  • 5
    [Are you sure you want to do this?](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) An html/xml parser is probably a better tool for the job... – PinnyM Nov 15 '12 at 22:10
  • I was just about to post that! – Zajn Nov 15 '12 at 22:12
  • 1
    yeah I am sure, I am just learning on how to use regular expression on ruby, I might not even use it in real life work.. – user486174 Nov 15 '12 at 22:14
  • Unless you are dealing with HTML you own and it will never change, and its a simple use, you should seriously consider using a parser. Regex isn't capable of processing even reasonably complex HTML. – the Tin Man Nov 15 '12 at 22:19
  • How do you identify what you want to match in the context of the page? What do you actually want as output? Are there possible IDs or classes on the div or its contents? – Mark Thomas Nov 15 '12 at 23:13

3 Answers3

2

Nokogiri is great but, imho, there are situations when it can not be used.

For your mere case you can use this:

puts str.scan(/<div>(.*)<\/div>/im).flatten.first

<p>test content</p>
  • Too brittle. There are *so* many ways this can break if his actual page differs even slightly from his given simplified example. – Mark Thomas Nov 15 '12 at 23:08
  • 1
    Also, I've done a lot of scraping, and I have never encountered a situation where Nokogiri cannot be used. – Mark Thomas Nov 15 '12 at 23:14
  • +1 @MarkThomas, I've used Nokogiri for years and never found a situation where it was inadequate. I've written big spiders and RSS aggregators with it and it handled broken XML that caused other XML parsers to crash badly. – the Tin Man Nov 16 '12 at 00:43
  • sure, it can be used on any HTML, i just meant sometimes it just can not be installed, so can not be used on some systems –  Nov 16 '12 at 05:11
1

To match the <div> when it's all on one line, use:

/<div[^>]*>/

But, that will break on any markup with a new-line inside the tag. It'll also break if there is whitespace between < and div, which there could be.

Eventually, after you've added in all the extra checks for the possible ways a tag can be written you'll want to consider a better way, which would be to use a parser, like Nokogiri, which makes working with HTML and XML much easier.

For instance, since you're trying to tear apart the HTML:

<div>
<p>test content</p>
</div>

it's pretty easy to guess you really want to get to "test content". What if the HTML changed to:

<div><p>test content</p></div>

or worse:

<div
><p>
test
content
</div>

A browser won't care, nor will a good parser, but a regex will get upset and require rework.

require 'nokogiri'
require 'pp'

doc = Nokogiri.HTML(<<EOT)
    <div
    ><p>
    test
    content
    </div>
EOT
pp doc.at('p').text.strip.gsub(/\s+/, ' ')
# => "test content"

That's why we recommend parsers.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
0

An HTML parser such as Nokogiri would probably be a better option than using a Regex as PinnyM pointed out.

Here is a tutorial on the Nokogiri page that describes how to search an HTML/XML document.

This stackoverflow question demonstrates something similar to what you want to accomplish using CSS selectors. Perhaps something like that would work for you.

Community
  • 1
  • 1
Zajn
  • 4,078
  • 24
  • 39