0

I expected to find this in SO already... but haven't so far

I'm talking about a regex which looks at an HTML ENCODED string: e.g. something like

blip ♦ trout’s mouth

Have I covered all the bases with &\w+; and &#[0-9]+;?

$encoded_string = htmlspecialchars($_GET["searchterms"]);
echo "<b>Search results for submitted string: \"$encoded_string\"</b><br><br>";
$html_special_chars_pattern = "!(&\\w+;|&#[0-9]+;)!";
$non_html_tokens = preg_split( $html_special_chars_pattern, $encoded_string, -1, PREG_SPLIT_DELIM_CAPTURE );
mike rodent
  • 14,126
  • 11
  • 103
  • 157

2 Answers2

4

You are missing the &#xH; or &#XH; numeric character references.

5.3.1 Numeric character references

Numeric character references specify the code position of a character in the document character set. Numeric character references may take two forms:

  • The syntax "&#D;", where D is a decimal number, refers to the ISO 10646 decimal character number D.

  • The syntax "&#xH;" or "&#XH;", where H is a hexadecimal number, refers to the ISO 10646 hexadecimal character number H. Hexadecimal numbers in numeric character references are case-insensitive.

That is, &#[xX][a-fA-F0-9]+; in regular expression.

Community
  • 1
  • 1
Alexander
  • 23,432
  • 11
  • 63
  • 73
1

I have put my earlier related post as an answer here. If someone else comes up with a better solution or why it would break, do let me know :)

preg_match_all('/&(?:[a-z]+|#\d+);/', $content, $matches);

To support hexadecimal entities as well:

preg_match_all('/&(?:[a-z]+|#x?\d+);/i', $content, $matches);

Btw, (?: ... ) is used to prevent memory captures. See also: What does `?` mean in this Perl regex?

Community
  • 1
  • 1
Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • thanks... as you can see in the answer below, there appears also to be the question of the hexadecimal refs. Also I'm trying to understand what function the "?:" sequence has in your regex... – mike rodent Dec 16 '12 at 13:59
  • @mikerodent, `(?:)` is a non-capturing group – Alexander Dec 16 '12 at 14:01