A simple Regular expression that bothers me

Question

I have the following text:

<!--:en-->&nbsp;

<!--:-->

I want to construct a pattern to extract it from a string (PHP). I try with:

<!--:[a-z]{2}-->(&nbsp;\r\n\s)<!--:-->

But it does not work, does anybody know why or could help me?

score 3 · Accepted Answer · edited May 23 '17 at 11:49

3

You probably don't want to use regex to parse XML/HTML.

And that for a lot of reasons.

Instead usually you would prefer to parse with tools made for this specific task.

Anyway, what you need here is more something like:

(&nbsp;|\s)*

edited May 23 '17 at 11:49

Community

1
1

answered Apr 11 '12 at 14:40

Colin Hebert

91,525
15
160
151

Leonard · Answer 2 · 2012-04-11T15:00:50.647

1

You need to escape special characters, such as hyphen. Try this:

/<\!\-{2}\:[a-z]{2}\-\->((&nbsp;|\s)*)<\!\-{2}\:\-{2}>/

edited Apr 11 '12 at 15:00

answered Apr 11 '12 at 14:42

Leonard

3,012
2
31
52

2

Be careful here you capture strings such as "ssspppsppps" – Colin Hebert Apr 11 '12 at 14:43
Thank you. I've emended my answer now to enforce – Leonard Apr 11 '12 at 14:46
1

Now you capture the but only one can be detected. – Colin Hebert Apr 11 '12 at 14:49
Phew, encapsulated the whole thing in another group, which ought to include and white spaces. At this rate, however, I'm inclined to agree with you that regular expressions might not be the right course of action for this kind of thing - in PHP you can easily implement this functionality using strpos and an incrementing cursor. – Leonard Apr 11 '12 at 15:02

score 1 · Answer 3 · answered Apr 11 '12 at 16:20

If I correctly understood your question, you have to match the entire text, comments included.

So, strictly about your specific problem, I would use something like that:

$s = "<!--:en-->&nbsp; 

<!--:-->";

$a = array();
preg_match('/<!--:[a-z]{2}-->&nbsp;\\s+<!--:-->/', $s, $a);

for ($i = 0; $i < count($a); $i++) {
  var_dump(htmlentities($a[$i]));
}

Generally, I do not question if you should parse HTML with regular expressions or not, but notice, though, that Colin is right when he says that realistically parsing HTML with regular expressions can be outstandingly hard (read "nearly impossible"), as the posts he indicated state.

A simple Regular expression that bothers me

3 Answers3