0

I'm trying to decode a string in a specific format (Hearthstone deck code), using PHP, like this:

AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA

or

AAEBAf0GBAD6DoyAA6CAAw37AZwCigbJB/gHlA+CEIUQrRDy0AL2/QKJgAPRgAMA

The specification (original description) is:

The datastring is a base64-encoded byte string.

Unless specified otherwise, every value that follows is an integer, encoded as an unsigned varint.

  1. Header block

    • Reserved byte 0x00
    • Version (1)
    • Format
  2. Data block
    The data block is split in four pairs of length + array in the following order:

    • Heroes
    • Single-copy cards
    • 2-copy cards
    • n-copy cards

Each pair has a leading varint specifying the number of items in the array. For the first three blocks, those are arrays of varints. For the last block, it is an array of pairs of varints. The goal of this structure is to make the datastring as compact as possible.

I have begun to put something together, but I'm a novice when it comes to handling raw bytes. My code is:

    // I found this to decode Variable-length quantity (varint)
    function vlq_decode(array $bytes) {
        $result = [];
        $integer = 0;
        foreach ($bytes as $byte) {
            if ($integer > 0xfffffff - 0x7f) {
                throw new OverflowException('The value exceeds the maximum allowed.');
            }
            $integer <<= 7;
            $integer |= 0x7f & $byte;

            if (($byte & 0x80) === 0) {
                $result[] = $integer;
                $integer = 0;
            }
        }
        if (($byte & 0x80) !== 0) {
            throw new InvalidArgumentException('Incomplete byte sequence.');
        }
        return $result;
    }

    $datastring = 'AAEBAc2xAgjAAe0E7QX3DdYRh6wC8fsCoIADC8kDqwTLBPsMhRDH0wKW6AK0/ALNiQPXiQOfmwMA';

    $raw = base64_decode($datastring);

    $byte_array = unpack('C*', $raw);

    $result = vlq_decode($byte_array);

    print_r($result);

The only thing I'm certain about is the base64_decode. I can't tell, if the unpack parameters are the correct ones, or whether the vlq_decode function works as intended, because I didn't write it myself.

On the original site there are reference implementations in Python and Javascript, but they are over my head and I couldn't use the code to make mine work.

Update:

The code does indeed produce an array that looks similar to what I'd expect, but many of the values don't seem correct. I'm thinking the conversion from varint is still somewhat off.

// this is the $result I get (wrong)
Array (
    [0] => 0 // this is always 0
    [1] => 1 // Version
    [2] => 1 // Format
    [3] => 1 // What follows is an array of length 1 (data block Heroes)
    [4] => 1267842
    [5] => 8 // What follows is an array of length 8 (data block single-copy cards)
    [6] => 8193
    [7] => 13956
    [8] => 13957
    [9] => 15245
    [10] => 11025
    [11] => 120322
    [12] => 1867138
    [13] => 524291
    [14] => 11 // What follows is an array of length 11 (data block 2-copy cards)
    [15] => 9347
    [16] => 5508
    [17] => 9604
    [18] => 15756
    [19] => 656
    [20] => 1173890
    [21] => 373762
    [22] => 867842
    [23] => 1262723
    [24] => 1426563
    [25] => 511363
    [26] => 0  // What follows is an array of length 0 (data block n-copy cards)
)

The Python implementation (Gist) produces different numbers, in a slightly different format, which match nicely to the database containing the data for the IDs (in the dbfId field)

// this is the expected (correct) $result
Array (
    [0] => 0
    [1] => 1
    [2] => 1
    [3] => 1
    [4] => 39117
    [5] => 8
    [6] => 192 
    [7] => 621 
    [8] => 749 
    [9] => 1783 
    [10] => 2262 
    [11] => 38407 
    [12] => 48625 
    [13] => 49184 
    [14] => 11
    [15] => 457 
    [16] => 555 
    [17] => 587 
    [18] => 1659 
    [19] => 2053 
    [20] => 43463 
    [21] => 46102 
    [22] => 48692 
    [23] => 50381 
    [24] => 50391 
    [25] => 52639
    [26] => 0
)

Any help is appreciated!

There already is a question about this topic, but it was badly written with no code samples, so I'm giving this another go.

sjakobi
  • 3,546
  • 1
  • 25
  • 43
oelna
  • 2,210
  • 3
  • 22
  • 40
  • Keep reading the spec. Your code works, the first two values are exactly what it says on the box, the third you look up from the referenced enum code [1 means "wild], the there's an array length [1] and then 1 item, an array length [8] and then 8 items, etc. – Sammitch Jun 10 '19 at 22:41
  • @Sammitch the result looks correct at first, but the numbers don't match the expected output. I've added both to my question, if you care to take another look. I believe the `varint` conversion still goes wrong at some point. – oelna Jun 11 '19 at 22:56
  • It could be an endian-ness problem – Sammitch Jun 12 '19 at 00:14

1 Answers1

2

It's an endian problem, aka you need to push the bits from each varint byte in reverse order. The clue is that the values below 128 make it through unmolested.

The below is an illustrative hack and should not be used in actual code:

str_split(decbin(1267842),7)

Yields:

array(3) {
  [0]=>
  string(7) "1001101"
  [1]=>
  string(7) "0110001"
  [2]=>
  string(7) "0000010"
}

Super convenient that it's already a multiple of 7 bits, but probably also a symptom of the endian problem.

Reverse, implode, convert back:

bindec(implode('', array_reverse(str_split(decbin(1267842),7))))

Yields:

int(39117)

I re-jiggered that function to be able to address this:

function vlq_decode(array $bytes, $swap_endian=false) {
    $result = [];
    $segments = [];
    foreach ($bytes as $byte) {
        if( $swap_endian ) {
            array_unshift($segments, 0x7f & $byte);
        } else {
            $segments[] = ( 0x7f & $byte );
        }

        if (($byte & 0x80) === 0) {
            $integer = 0;
            foreach($segments as $segment) {
                $integer <<= 7;
                $integer |= ( 0x7f & $segment );
            }
            $result[] = $integer;
            $segments = [];
        }
    }
    if (($byte & 0x80) !== 0) {
        throw new InvalidArgumentException('Incomplete byte sequence.');
    }
    return $result;
}

and then vlq_decode($byte_array, true); will get you what you want.

I pared out that bunk overflow code as it will never actually detect an actual one, and also hobbles you to 32-bit ints. If you do want to detect an overflow during decoding you need to count the bits you're unpacking, and that's just a pain in the ass :P

Sammitch
  • 30,782
  • 7
  • 50
  • 77