1

I have the following text:

<!--:en-->&nbsp;

<!--:-->

I want to construct a pattern to extract it from a string (PHP). I try with:

<!--:[a-z]{2}-->(&nbsp;\r\n\s)<!--:-->

But it does not work, does anybody know why or could help me?

Federico Zancan
  • 4,846
  • 4
  • 44
  • 60
José Carlos
  • 1,005
  • 16
  • 29

3 Answers3

3

You probably don't want to use regex to parse XML/HTML.

And that for a lot of reasons.

Instead usually you would prefer to parse with tools made for this specific task.


Anyway, what you need here is more something like:

(&nbsp;|\s)*
Community
  • 1
  • 1
Colin Hebert
  • 91,525
  • 15
  • 160
  • 151
1

You need to escape special characters, such as hyphen. Try this:

/<\!\-{2}\:[a-z]{2}\-\->((&nbsp;|\s)*)<\!\-{2}\:\-{2}>/
Leonard
  • 3,012
  • 2
  • 31
  • 52
  • 2
    Be careful here you capture strings such as "ssspppsppps" – Colin Hebert Apr 11 '12 at 14:43
  • Thank you. I've emended my answer now to enforce   – Leonard Apr 11 '12 at 14:46
  • 1
    Now you capture the   but only one can be detected. – Colin Hebert Apr 11 '12 at 14:49
  • Phew, encapsulated the whole thing in another group, which ought to include   and white spaces. At this rate, however, I'm inclined to agree with you that regular expressions might not be the right course of action for this kind of thing - in PHP you can easily implement this functionality using strpos and an incrementing cursor. – Leonard Apr 11 '12 at 15:02
1

If I correctly understood your question, you have to match the entire text, comments included.

So, strictly about your specific problem, I would use something like that:

$s = "<!--:en-->&nbsp; 

<!--:-->";

$a = array();
preg_match('/<!--:[a-z]{2}-->&nbsp;\\s+<!--:-->/', $s, $a);

for ($i = 0; $i < count($a); $i++) {
  var_dump(htmlentities($a[$i]));
}

Generally, I do not question if you should parse HTML with regular expressions or not, but notice, though, that Colin is right when he says that realistically parsing HTML with regular expressions can be outstandingly hard (read "nearly impossible"), as the posts he indicated state.

Federico Zancan
  • 4,846
  • 4
  • 44
  • 60