3

I have the following HTML string:

<span class='together'>line one,<br><span class='indent'>line two.</span><br>Line three,<br><span class='indent'>line four,<br>line five,<br>line six,<br>line seven;<br>line eight.<br>Line nine;<br>line ten,<br>line eleven,<br>line twelve.</span><br>Line thriteen,<br><span class='indent'>line fourteen,<br>line fifteen,<br>line sixteen,<br>line seventeen,<br>line eighteen.</span></span>

I am trying to find a regex expression that will find all the <br>'s that are between the <span class='indent'> and it's closing </span>. The <span class='together'> encapsulates the whole sting and should just be ignored.

At the moment the best I can do is: <span class='indent'>.*?(<br>).*?<\/span> which doesn't work at all. The first <br> this grabs is outside of the <span> and then it skips over a bunch of other <br>'s that I want (See here).

Is this possible? Should I instead use <span class='indent'>(.*?)\<\/span> and then parse the captured group later?

As you can tell my regex knowledge is pretty limited.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Ampers
  • 191
  • 11

1 Answers1

1

In comments of other answer you wrote

The content between the spans will only have a <br> tag in it and no other HTML...

If there are only <br> tags / no other tags before <span class='indent'> try with a lookbehind. There's only finite repitition allowed so need to set a limit to what max length inside the span could be.

(?s)(?<=<span class='indent'>(?:(?!</?span).){0,9999}?)<br>

Just picked 9999, you might need higher value depending on input. Demo at regexplanet (click Java). (?!</?span). The negative lookahead is used to not skip a span when looking behind.

This only works for data like your sample and not with any nested spans. Use parser in this case.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • Thanks for your work bobble bubble. I'm marking this as the correct answer as it does do what I asked. However a parser might be the "correct" way to solve my issue. In fact I ended up using regex to find the contents of my indent span and then did a some simple finding and replacing dealing with the
    's
    – Ampers Nov 09 '15 at 02:02
  • 1
    You're welcome @Ampers, thank you! Sounds like you found the optimal way to deal with it. Well if parser or regex - I think it depends on the problem and if parsing arbitrary html or your own. – bobble bubble Nov 09 '15 at 03:09