-3

I have a probably simple answer, but cannot get my head around this. I have this multiple line text:

<p class='testing1_class'><span>Lorem Ipsum SomePhrase1 Lorem Lorem Lorem</span></p>
<p class='testing2_class'><span>Lorem Ipsum SomePhrase2 Lorem Lorem Lorem</span></p>
<p class='testing1_class'><span>Lorem Ipsum SomePhrase1 Lorem Lorem Lorem</span></p>

What I'd like to do is to find all terms "SomePhrase1" within the single <p>, not overlapping.

This is the pattern I get, which is overlapping.

<p.*?_class'><span.*?(SomePhrase1).*?<\/p>\n

Attributes: /isg

Could anybody please help me? Thanks a lot!

OnlineCop
  • 4,019
  • 23
  • 35
chrney
  • 257
  • 1
  • 2
  • 12
  • 3
    Are you using Javascript to parse this, or are you using another regex language? The answer below by @davidrac assumes non-Javascript due to the use of `(?<=...)` look-behinds, so updating your question with details (or adding specific tags to this post) would help clarify which language you will be utilizing. – OnlineCop Oct 19 '15 at 17:22
  • Sorry - PHP, PREG-functions is what I aim for. – chrney Oct 20 '15 at 04:18

2 Answers2

0

As always with this kind of question, your best option for XML/HTML is to use an XML or HTML parser.

If you insist on using a regex:

This should work in case your input is similar to the example:

(?<=<span>Lorem Ipsum ).*?(?= Lorem Lorem Lorem<\/span>)

If you need to restrict it further you can use this regex:

(?<=<p class='testing\d_class'><span>Lorem Ipsum ).*?(?= Lorem Lorem Lorem<\/span>)

If you're using a regex variant without lookaround capabilities, simply replace with capture groups and pick the second one:

(<span>Lorem Ipsum )(.*?)( Lorem Lorem Lorem<\/span>)

or

(<p class='testing\d_class'><span>Lorem Ipsum )(.*?)( Lorem Lorem Lorem<\/span>)
davidrac
  • 10,723
  • 3
  • 39
  • 71
  • I am sorry but I cannot get your example(s) to work here, https://regex101.com/r/rO6sI3/1 - could you pls help me out here? Thanks! – chrney Oct 19 '15 at 18:35
  • That's because you have an extra newline at the end of the regex there https://regex101.com/r/rO6sI3/2 – davidrac Oct 19 '15 at 18:42
0

A language such as PHP (or which uses PCRE) has a \K token which means "reset the match so far." That means that you can very specifically indicate the text that should occur before the part you want to match, reset the match, and your "regex cursor" will start just after that portion.

You can see this example here where the <p> element is found first, and once any other <...> elements are matched, the \K is used to reset the match. As you can see, the captured text only highlights if SomePhrase1 exists.

EDIT:

There are many edge cases that you may have to account for, where XML/HTML just utterly fail:

<p class='testing1_class'><span>Lorem Ipsum SomePhrase1 Lorem Lorem Lorem</span></p>
<p class='testing2_class'><span>Lorem Ipsum SomePhrase2 Lorem Lorem Lorem</span></p>
<p class='testing1_class'><span>Lorem Ipsum SomePhrase1 Lorem Lorem Lorem</span></p>
<span><p class="testing2_class"><p>Lorem Ipsum SomePhrase1 Lorem Lorem Lorem</p></p></span>
Lorem Ipsum SomePhrase1 Lorem Lorem Lorem
<span class='testing1_class'>Lorem Ipsum SomePhrase1 Lorem Lorem Lorem</span>
<p>Lorem Ipsum SomePhrase1 Lorem Lorem Lorem</p>
<p style='color: black;' class='foo bar testing1_class baf' id='#magic'>Lorem Ipsum SomePhrase1 Lorem Lorem Lorem</p>
<p class='testing1_class'>Lorem Ipsum <span>SomePhrase1</span> Lorem Lorem Lorem</p>
<p class='testing1_class'>Lorem Ipsum Lorem Lorem Lorem</p>
<p class='testing1_class'>Lorem <p>Ipsum SomePhrase1 Lorem</p> Lorem Lorem</p>
<p class='testing1_class'>SomePhraseX</p><p class='testing1_class'>WrongPhrase</p><p class='testing1_class'>Another Wrong Phrase</p>

The regex to handle all these cases is very fragile and will become very complicated. jQuery would allow you to access it MUCH simpler, however: JSFIDDLE

OnlineCop
  • 4,019
  • 23
  • 35
  • Thanks! PHP using PCRE, correct. Could you help me to provide an example where the

    's class attribute is taken into consideration as well? Tried that, but no go for me.

    – chrney Oct 19 '15 at 18:36
  • You can, such as [this DEMO](https://regex101.com/r/rA8wT5/2), although you are going to have to deal with a *lot* of edge cases. – OnlineCop Oct 19 '15 at 19:21