3

I have stumbled upon an interesting piece of code written in Python:

from struct import pack

chars = [109, 0, 97, 0, 110, 0, 105, 0, 102, 0, 101, 0, 115, 0, 116, 0]
length = 16

data = ""
for i in range(0, length):
    ch = pack("=b", chars[i])
    data += unicode(ch, errors='ignore')

    if data[-2:] == "\x00\x00":
        break

end = data.find("\x00\x00")
if end != -1:
    data = data[:end]

print(len(data.decode("utf-16", "replace"))) // outputs 8, string is 'manifest'

As you can see, Python does decode utf-16 properly. However, when I try to port the code to PHP I get bad results:

$chars = array(109, 0, 97, 0, 110, 0, 105, 0, 102, 0, 101, 0, 115, 0, 116, 0);
$length = 16;

$data = "";
for ($i = 0; $i < $length; $i++) {
    $data .= pack("c", $chars[$i]);

    if (substr($data, -2) == "\x00\x00") {
        break;
    }
}

$end = strpos($data, "\x00\x00");
if ($end !== false) {
    $data = substr($data, 0, $end);
}

// md_convert_encoding() doesn't seem to work
printf(strlen($data)); // outputs 16

The only solution I see is to just give up on the UTF magic and change the loop to:

for ($i = 0; $i < $length; $i+=2)

Is there anything I can do about this, or just use the modified for loop?

Thank you.

Vanity
  • 33
  • 1
  • 4
  • "Decode UTF-16" *to what* exactly? – deceze Jul 04 '14 at 10:42
  • Your primary problem is that `utf8_encode` is nowhere near what `unicode` does in Python. – deceze Jul 04 '14 at 10:44
  • @deceze: Yeah, I've noticed that at some point, but it seems to have slipped by; I'll remove it. – Vanity Jul 04 '14 at 10:46
  • To answer my own question three comments up: your question should be *"Interpreting an array of integers as UTF-16 encoded bytes and converting it to a UTF-8 encoded string"*... – deceze Jul 04 '14 at 11:48

1 Answers1

2

First of all take a look at How can I convert array of bytes to a string in PHP?.

Using that solution you would convert your byte array to a string like

$chars = array(109, 0, 97, 0, 110, 0, 105, 0, 102, 0, 101, 0, 115, 0, 116, 0);
$str = call_user_func_array("pack", array_merge(array("C*"), $chars));
$convertedStr = iconv('utf-16', 'utf-8', $str);

var_dump($str);
var_dump($convertedStr);

Executing this script outputs

string(16) "manifest"
string(8) "manifest"
Community
  • 1
  • 1
ragol
  • 527
  • 3
  • 11
  • Ah, `iconv` seems to do much better! By the way, is there any noticeable difference between 'C*' and 'c'? – Vanity Jul 04 '14 at 10:43
  • You need the asterisk because otherwise only the first character will be packed. Whether you use 'c' or 'C' doesn't make a difference in this case, because all numbers are less than 127 and thus fit in both signed and unsigned character byte. – ragol Jul 04 '14 at 10:50