0

I'm having some trouble defining my question, sorry for the bad title.

I KNOW REGEX IS BAD TO PARSE HTML, BUT I DONT HAVE ANOTHER OPTION I'll just tell you what I've got.

I have the following string :

<span id=pink>some short text</span>
  more text that can be a few lines
 <span id=pink>again a short text</span> 
 More text that's abiding the same logic
<span id=pink>Repeat</span>...(more of these)

And this repeats itself multiple times.

Now I want to extract the text between text and to the next one. Meaning for the above example I'd like to return :

  • more text that can be a few lines
  • More text that's abiding the same logic

Now I've tried the following regex:

preg_match_all('/<span id=pink>.*?<\/span>(.*?)<span id=pink>.*?</span>/s',$data,$content);

Which partially works, however the problem is after finding the 1st match it doesn't detect the <span id=pink> that closed the previous group as the opener of the next group. Meaning with the above example it will only find the first group, and with more "rows" in the string it will skip every other group.

EDIT:

  • There are no new lines in the string, here just for simplifying.
  • I know HTML parsing is better using a parser instead of regex, but sadly I need to solve this using regex.

How can I solve this? It feels like I'm missing some simple solution, is there a modifier perhaps that achieves this?

Thanks, Eric

eric.itzhak
  • 15,752
  • 26
  • 89
  • 142
  • 1
    Don't use regex for matching HTML – anubhava Apr 22 '17 at 18:15
  • @anubhava The original string is not a valid html so I had problems using a DOMParser – eric.itzhak Apr 22 '17 at 18:16
  • I doubt that what you show is your real string since an id is supposed to be unique. Show your real string. – Casimir et Hippolyte Apr 22 '17 at 18:16
  • @CasimiretHippolyte Like I mentioned above, the HTML is invalid and sadly, yes there are multiple id's named pink. I simplified the string because it's long and in a different language. – eric.itzhak Apr 22 '17 at 18:17
  • @WiktorStribiżew Thanks for the link but I don't feel like it's really related to my issue. Basically what I want is the "closing" pattern to be reused as the "opening" pattern – eric.itzhak Apr 22 '17 at 18:25
  • @CasimiretHippolyte This is not a duplicate dude, I think you didn't understand what I asked but the answer I marked as correct is the exact answer I was looking for, and not any of the answers in the question you marked. – eric.itzhak Apr 24 '17 at 07:13

1 Answers1

0

Basically what I want is the "closing" pattern to be reused as the "opening" pattern

If that is the case then you can use a lookahead:

<span id=pink>.*?<\/span>(.*?)(?=<span|$)

Live Demo

You of course still need to use the single line flag.

vallentin
  • 23,478
  • 6
  • 59
  • 81
  • People have been throwing vote downs here if they don't understand the question/answer... doesn't this require the span to be followed by new lines? Asking because of the span|$. Problem is the next won't necessarily be in a new line – eric.itzhak Apr 22 '17 at 18:31
  • You of course still need the single line flag. Take a look at the live demo. – vallentin Apr 22 '17 at 18:32
  • Yes I just tried that it seems to work! Thank you man! What was the magic here though? the 2nd group or the ?= – eric.itzhak Apr 22 '17 at 18:34
  • 1
    Yes, the magic is the `?=` (positive lookahead). In its simplicity it checks if something comes after, but doesn't include it in the match. Thus the next match still includes that part. – vallentin Apr 22 '17 at 18:36