1

This is my regex:

/<strong>.*ingredients.*<\/ul>/im

Assuming the source code:

<strong>Contest closes on Thursday May 10th 2012 at 9pm PST</strong></div>
<br />
<br />
<br />
* I am not affiliated with Blue Marble Brands or Ines Rosales Tortas in any way.&nbsp; I am not sponsored by them and did not receive any compensation to write this post...I just simply think the&nbsp;Tortas&nbsp;are wonderful!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-35J5vNrXkqE/T6htXTafrmI/AAAAAAAAA5E/g2mtiuSpSmw/s1600/food+003.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" mea="true" src="http://1.bp.blogspot.com/-35J5vNrXkqE/T6htXTafrmI/AAAAAAAAA5E/g2mtiuSpSmw/s640/food+003.JPG" width="640" /></a></div>
<br />
<strong><span style="font-size: large;">Ingredients:</span></strong><br />
<ul>
<li>Ines Rosales Rosemary and Thyme Tortas</li>
<li>Pizza Sauce (ready made in a jar)</li>
<li>Roma Tomatoes</li>
<li>Roasted Red Peppers </li>
<li>Marinated Artichoke Hearts</li>
<li>Olives (I used Pitted Spanish Manzanilla Olives)</li>
<li>Daiya Vegan Mozzarella Cheese</li>
</ul>
<span style="font-size: large;"><strong>Directions:</strong></span><br />
<br />
Spread small amount of pizza sauce over Torta. 

the Regex is greedy and grabs everything from <strong>Contest...</ul> but the shortest match should yield <strong><span style="font-size: large;">Ingredients...</ul>

this is my gist: https://gist.github.com/3660370

::EDIT:: Please allow flexibility inbetween strong tag and ingredients, and ingredients and ul.

Mr. Demetrius Michael
  • 2,326
  • 5
  • 28
  • 40
  • 3
    Note that with Ruby you can use `%r{..}` to denote your regex literals, so that you don't have to escape forward slashes, e.g. `%r{.*?ingredients.*?}im` – Phrogz Sep 06 '12 at 21:24
  • @KarolyHorvath - using the nongreedy `?` will not work here with `.*?` because he needs the first `` to be a late match. – Kash Sep 06 '12 at 21:55
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Reactormonk Sep 06 '12 at 22:09

3 Answers3

0

Try this:

/<strong><span.*ingredients.*<\/ul>/im

Please refrain from regex-ing html. Use Nokogiri or a similar library instead.

Daniel Szmulewicz
  • 3,971
  • 2
  • 25
  • 26
  • 1
    +1 for recommending Nokogiri, but -1 for still using the greedy Kleene star instead of the non-greedy `.*?`. – Phrogz Sep 06 '12 at 21:25
  • Nokogiri isn't applicable in this particular case... I use it a lot, but I'm parsing different websites with strong, ingredients, and ul, and random stuff inbetween.. I need to keep it as high level as possible. Sometimes the xml isn't formated well, but the parsing engine should be powerful enough to handle that. – Mr. Demetrius Michael Sep 06 '12 at 21:35
  • It was meant as a general guideline. Good you're aware of it. Please accept answer if this gives you the shortest match. – Daniel Szmulewicz Sep 06 '12 at 21:51
  • 1
    @DanielSzmulewicz - your regex is applicable to the specific example that OP has given as an assumption. This will still not solve the original problem of a shortest match. – Kash Sep 06 '12 at 21:57
  • @Kash. Correct. My apologies, I didn't realize what I was getting into. – Daniel Szmulewicz Sep 06 '12 at 22:54
0

This should work:

/(?!<strong>.*<strong>.*<\/ul>)<strong>.*?ingredients.*?<\/ul>/im

Test it here

Basically, the regex is using the negative lookahead to avoid multiple <strong> before <\ul> like this: (?!<strong>.*<strong>.*<\/ul>)

Kash
  • 8,799
  • 4
  • 29
  • 48
  • Very nice solution. I think you meant negative lookbehind though. – Daniel Szmulewicz Sep 06 '12 at 23:00
  • It is still a negative lookahead. Lookbehinds are denoted by `(?<!....)`. Unfortunately, this solution will not work for 3 instances of `` and `<\ul>`. Need to tweak this a bit. – Kash Sep 07 '12 at 05:23
0

I think this is what you're looking for:

/<strong>(?:(?!<strong>).)*ingredients.*?<\/ul>/im

Replacing the first .* with (?:(?!<strong>).)* allows it to match anything except another <strong> tag before it finds ingredients. After that, the non-greedy .*? causes it to stop matching at the first instance of </ul> it sees. (Your sample only contains the one <UL> element, but I'm assuming the real data could have more.)

The usual warnings apply: there are many ways this regex can be fooled even in perfectly valid HTML, to say nothing of the dreck we usually see out there.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156