PHP Multibyte UTF-8 Strings are Slowly Degrading

Question

I'm trying to convert the following javascript function to PHP:

function lzw_decode(s) {
    var dict = {};
    var data = (s + "").split("");
    var currChar = data[0];
    var oldPhrase = currChar;
    var out = [currChar];
    var code = 256;
    var phrase;
    debugger;
    for (var i=1; i<data.length; i++) {
        var currCode = data[i].charCodeAt(0);
        if (currCode < 256) {
            phrase = data[i];
        }
        else {
           phrase = dict[currCode] ? dict[currCode] : (oldPhrase + currChar);
        }
        out.push(phrase);
        currChar = phrase.charAt(0);
        dict[code] = oldPhrase + currChar;
        code++;
        oldPhrase = phrase;
    }
    return out.join("");
}

This is my converted code:

function uniord($c) {

    $a = unpack('N', mb_convert_encoding($c, 'UCS-4BE', 'UTF-8'));
    return($a[1]);

}

function mb_str_split( $string ) { 
    # Split at all position not after the start: ^ 
    # and not before the end: $ 
    return preg_split('/(?<!^)(?!$)/u', $string ); 
}

function decode($s){

    $dict = array();

    $data = mb_str_split($s);

    // print_r($data);


    $currChar = $data[0];
    // echo $currChar;
    // exit();
    $oldPhrase = $currChar;

    $out = array();
    $out[] = $currChar;
    $code = 256;
    $phrase;

    for ($i=1; $i < count($data); $i++) { 

        $currCode = uniord($data[$i]);

        if($currCode < 256){
            $phrase = $data[$i];
        }else{
            $phrase = isset($dict[$currCode]) ? $dict[$currCode] : ($oldPhrase . $currChar);
        }

        $out[] = $phrase;


        $currChar = mb_substr($phrase, 0, 1);
        $dict[$code] = $oldPhrase . $currChar;

        $code++;
        $oldPhrase = $phrase;

    }

    return implode("", $out);

}

While this function is working for some LZW encoded strings, if you use a long enough string you can see that it's not 100% accurate. My guess is that it's a problem with multibyte strings and my carelessness. Anyone have any ideas?

"*if you use a long enough string you can see that it's not 100% accurate*" - an example string would be nice. — DCoder, Apr 29 '12 at 06:16
Sorry, here's an example where the output is not correct: http://pastebin.com/yEALY3zF — xd44, Apr 29 '12 at 06:20
I took the `lzw_encode` function from [here](http://stackoverflow.com/a/294421/1233508), encoded your correct output with it, ran it through your `decode` and got the right result back. The problem might be somewhere else - the input you're giving to `decode` might be in a charset other than UTF-8, or there might be a bug in your corresponding `encode`. — DCoder, Apr 29 '12 at 07:01
There seems to be already a library for lzw for php. You can find it here: http://code.google.com/p/php-lzw/ — Attila Szeremi, Jul 13 '12 at 13:54

PHP Multibyte UTF-8 Strings are Slowly Degrading

0 Answers0