3

I have a data file (an Apple plist, to be exact), that has Unicode codepoints like \U00e8 and \U2019. I need to turn these into valid hexadecimal HTML entities using PHP.

What I'm doing right now is a long string of:

 $fileContents = str_replace("\U00e8", "è", $fileContents);
 $fileContents = str_replace("\U2019", "’", $fileContents);

Which is clearly dreadful. I could use a regular expression to convert the \U and all trailing 0s to &#x, then stick on the trailing ;, but that also seems heavy-handed.

Is there a clean, simple way to take a string, and replace all the unicode codepoints to HTML entities?

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
Tina Marie
  • 83
  • 1
  • 5
  • PCRE regular expressions are pretty fast and safe; I'd use them. (Other, official solutions will probably use a regex too. Or a lookup table, which is what you have now.) – MvanGeest Aug 13 '10 at 19:30
  • 2
    According to [this page](http://code.google.com/p/networkpx/wiki/PlistSpec), those escape sequences represent UTF-16 code units, not Unicode code points. This means you may have to combine two successive code units (if they form a surrogate pair) to form an HTML entity. – Artefacto Aug 13 '10 at 21:30

2 Answers2

7

Here's a correct answer, that deals with the fact that those are code units, not code points, and allows unencoding supplementary characters.

function unenc_utf16_code_units($string) {
    /* go for possible surrogate pairs first */
    $string = preg_replace_callback(
        '/\\\\U(D[89ab][0-9a-f]{2})\\\\U(D[c-f][0-9a-f]{2})/i',
        function ($matches) {
            $hi_surr = hexdec($matches[1]);
            $lo_surr = hexdec($matches[2]);
            $scalar = (0x10000 + (($hi_surr & 0x3FF) << 10) |
                ($lo_surr & 0x3FF));
            return "&#x" . dechex($scalar) . ";";
        }, $string);
    /* now the rest */
    $string = preg_replace_callback('/\\\\U([0-9a-f]{4})/i',
        function ($matches) {
            //just to remove leading zeros
            return "&#x" . dechex(hexdec($matches[1])) . ";";
        }, $string);
    return $string;
}
Artefacto
  • 96,375
  • 17
  • 202
  • 225
4

You can use preg_replace:

preg_replace('/\\\\U0*([0-9a-fA-F]{1,5})/', '&#x\1;', $fileContents);

Testing the RE:

PS> 'some \U00e8 string with \U2019 embedded Unicode' -replace '\\U0*([0-9a-f]{1,5})','&#x$1;'
some &#xe8; string with &#x2019; embedded Unicode
Joey
  • 344,408
  • 85
  • 689
  • 683
  • Seems like a clear use case for regex. @Tina Marie, check out http://code.google.com/p/cfpropertylist/ if you need any more plist handling. – Brandon Horsley Aug 13 '10 at 19:37