2

I have a problem understanding this code. What I know is that we have passed a code into a assembler that has converted code into "byte code". Now I have a Virtual machine that is supposed to read this code. This function is supposed to read the first byte code instruction. I don't understand what is happening in this code. I guess we are trying to read this byte code but don't understand how it is done.

static int32_t  bytecode_to_int32(const uint8_t *bytecode, size_t size)
{
    int32_t result;
    t_bool  sign;
    int     i;

    result = 0;
    sign = (t_bool)(bytecode[0] & 0x80);
    i = 0;
    while (size)
    {
        if (sign)
            result += ((bytecode[size - 1] ^ 0xFF) << (i++ * 8));
        else
            result += bytecode[size - 1] << (i++ * 8);
        size--;
    }
    if (sign)
        result = ~(result);
    return (result);
}
jps
  • 20,041
  • 15
  • 75
  • 79
  • 1
    To understand what the code tries to do you should look at the definition of the "Byte Code" and how it is stored in the memory/file after the Assembler is done. All in all this just copies Bytes into some 32 bit variable by applying some spcific rules. – David J Feb 06 '20 at 10:20
  • That's a really good point. It may not be correct in what it does or how it's implemented. – Owl Feb 06 '20 at 10:29
  • If the purpose of this code is to translate big-endian, network-byte-order values into host endianness, why not just use [the POSIX standard `ntohl()` function](https://pubs.opengroup.org/onlinepubs/9699919799/functions/htonl.html)? (Just be aware that you **can not** do something like `ntohl( *( (uint32_t * ) bytecode ) )` as that would be a strict aliasing violation and can result in failures such as `SIGSEGV` or `SIGBUS`.) – Andrew Henle Feb 06 '20 at 13:23

4 Answers4

4

This code is somewhat badly written, lots of operations on a single line and therefore containing various potential bugs. It looks brittle.

  • bytecode[0] & 0x80 Simply reads the MSB sign bit, assuming it's 2's complement or similar, then converts it to a boolean.
  • The loop iterates backwards from most significant byte to least significant.
  • If the sign was negative, the code will perform an XOR of the data byte with 0xFF. Basically inverting all bits in the data. The result of the XOR is an int.
  • The data byte (or the result of the above XOR) is then bit shifted i * 8 bits to the left. The data is always implicitly promoted to int, so in case i * 8 happens to give a result larger than INT_MAX, there's a fat undefined behavior bug here. It would be much safer practice to cast to uint32_t before the shift, carry out the shift, then convert to a signed type afterwards.
  • The resulting int is converted to int32_t - these could be the same type or different types depending on system.
  • i is incremented by 1, size is decremented by 1.
  • If sign was negative, the int32_t is inverted to some 2's complement negative number that's sign extended and all the data bits are inverted once more. Except all zeros that got shifted in with the left shift are also replaced by ones. If this is intentional or not, I cannot tell. So for example if you started with something like 0x0081 you now have something like 0xFFFF01FF. How that format makes sense, I have no idea.

My take is that the bytecode[size - 1] ^ 0xFF (which is equivalent to ~) was made to toggle the data bits, so that they would later toggle back to their original values when ~ is called later. A programmer has to document such tricks with comments, if they are anything close to competent.


Anyway, don't use this code. If the intention was merely to swap the byte order (endianess) of a 4 byte integer, then this code must be rewritten from scratch.

That's properly done as:

static int32_t big32_to_little32 (const uint8_t* bytes)
{
  uint32_t result = (uint32_t)bytes[0] << 24 | 
                    (uint32_t)bytes[1] << 16 | 
                    (uint32_t)bytes[2] <<  8 | 
                    (uint32_t)bytes[3] <<  0 ; 

  return (int32_t)result;
}

Anything more complicated than the above is highly questionable code. We need not worry about signs being a special case, the above code preserves the original signedness format.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • Which lines would you suggest breaking up into smaller statements? – Z4-tier Feb 06 '20 at 10:27
  • 2
    @Z4-tier Most of them. Mixing `++` with other operators is to play with fire, for example. I would have written something along the lines of: `n = i * 8;`, `result += (int32_t) ( (uint32_t)(bytecode[size - 1] ^ 0xFF) << n );`, `i++;`. And then we soon discover that the loop could probably be written as `for(int i=0; i – Lundin Feb 06 '20 at 10:29
  • The order is well defined, but I agree it's horrible practice. The increment doesn't happen until after the line. – Owl Feb 06 '20 at 10:33
  • 2
    @Owl Well-defined until patch 2.0, when the poor sod maintaining this was confused by the unreadable code, but went in to change the indices: `(bytecode[size-1-i] ^ 0xFF) << (i++ * 8));`. And oops, very subtle undefined behavior bug that will be hard to track down. This bug was created by the _original_ programmer, not the maintainer. – Lundin Feb 06 '20 at 10:36
  • @Lundin yes so true that! – Owl Feb 06 '20 at 10:37
  • Anyway, don't use this code. If the intention was merely to swap the byte order of a 4 byte integer, then the code must be rewritten from scratch. That's properly done as `uint32_t result = (uint32_t)byte[0] << 24 | (uint32_t)byte[1] << 16 | (uint32_t)byte[2] << 8 | (uint32_t)byte[3] << 0; ... return (int32_t)result;`. And that's it. – Lundin Feb 06 '20 at 11:56
  • @Lundin If you start with 0x00, 0x81 then you get 0xFFFF8100. Which is the stored value sign extended. – Goswin von Brederlow Feb 06 '20 at 12:08
  • 1
    I think the whole point of the complex bit manipulations is to sign extend the value to 32bit no matter what size is given. It can read 1, 2, 3 or 4 byte values. Everything larger gives garbage. Although size 4 has undefined behavior but will work anyway since it can cause signed integer overflows. – Goswin von Brederlow Feb 06 '20 at 12:10
  • @Lundin Thank you ! It works very good with your code.. Unfortunately I don't understand pretty good what you did.. – gregouz1995 Feb 06 '20 at 14:20
2

So the A^0xFF toggles the bits set in A, so if you have 10101100 xored with 11111111.. it will become 01010011. I am not sure why they didn't use ~ here. The ^ is a xor operator, so you are xoring with 0xFF.

The << is a bitshift "up" or left. In other words, A<<1 is equivalent to multiplying A by 2.

the >> moves down so is equivalent to bitshifting right, or dividing by 2.

The ~ inverts the bits in a byte.

Note it's better to initialise variables at declaration it costs no additional processing whatsoever to do it that way.

sign = (t_bool)(bytecode[0] & 0x80); the sign in the number is stored in the 8th bit (or position 7 counting from 0), which is where the 0x80 is coming from. So it's literally checking if the signed bit is set in the first byte of bytecode, and if so then it stores it in the sign variable.

Essentially if it's unsigned then it's copying the bytes from from bytecode into result one byte at a time.

If the data is signed then it flips the bits then copies the bytes, then when it's done copying, it flips the bits back.

Personally with this kind of thing i prefer to get the data, stick in htons() format (network byte order) and then memcpy it to an allocated array, store it in a endian agnostic way, then when i retrieve the data i use ntohs() to convert it back to the format used by the computer. htons() and ntohs() are standard C functions and are used in networking and platform agnostic data formatting / storage / communication all the time.

Owl
  • 1,446
  • 14
  • 20
  • It's a xor operator I believe, when you xor a number against 0xFF it should flip the bits. – Owl Feb 06 '20 at 10:17
1

This function is a very naive version of the function which converts form the big endian to little endian.

The parameter size is not needed as it works only with the 4 bytes data.

It can be much easier archived by the union punning (and it allows compilers to optimize it - in this case to the simple instruction):

#define SWAP(a,b,t)    do{t c = (a); (a) = (b); (b) = c;}while(0)

int32_t my_bytecode_to_int32(const uint8_t *bytecode)
{
    union 
    {
        int32_t i32;
        uint8_t b8[4];
    }i32;
    uint8_t b;

    i32.b8[3] = *bytecode++;
    i32.b8[2] = *bytecode++;
    i32.b8[1] = *bytecode++;
    i32.b8[0] = *bytecode++;

    return i32.i32;
}

int main()
{
    union {
        int32_t i32;
        uint8_t b8[4];
    }i32;
    uint8_t b;


    i32.i32 = -4567;
    SWAP(i32.b8[0], i32.b8[3], uint8_t);
    SWAP(i32.b8[1], i32.b8[2], uint8_t);

    printf("%d\n", bytecode_to_int32(i32.b8, 4));


    i32.i32 = -34;
    SWAP(i32.b8[0], i32.b8[3], uint8_t);
    SWAP(i32.b8[1], i32.b8[2], uint8_t);

    printf("%d\n", my_bytecode_to_int32(i32.b8));
}

https://godbolt.org/z/rb6Na5

0___________
  • 60,014
  • 4
  • 34
  • 74
  • 1
    If the purpose was to convert endianess, then the original code is plain horrible and invokes UB for signed numbers. – Lundin Feb 06 '20 at 11:34
  • 1
    @Lundin it is exactly what it does. "Arduino style" code and as most of it (the code usually is written by the hobbyists) is unfortunately horrible – 0___________ Feb 06 '20 at 11:39
  • 1
    Arduino C implementations lack [`ntohl()`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/htonl.html)? – Andrew Henle Feb 06 '20 at 13:21
0

If the purpose of the code is to sign-extend a 1-, 2-, 3-, or 4-byte sequence in network/big-endian byte order to a signed 32-bit int value, it's doing things the hard way and reimplementing the wheel along the way.

This can be broken down into a three-step process: convert the proper number of bytes to a 32-bit integer value, sign-extend bytes out to 32 bits, then convert that 32-bit value from big-endian to the host's byte order.

The "wheel" being reimplemented in this case is the the POSIX-standard ntohl() function that converts a 32-bit unsigned integer value in big-endian/network byte order to the local host's native byte order.

The first step I'd do is to convert 1, 2, 3, or 4 bytes into a uint32_t:

#include <stdint.h>
#include <limits.h>
#include <arpa/inet.h>
#include <errno.h>

// convert the `size` number of bytes starting at the `bytecode` address
// to a uint32_t value
static uint32_t bytecode_to_uint32( const uint8_t *bytecode, size_t size )
{
    uint32_t result = 0;

    switch ( size )
    {
    case 4:
        result = bytecode[ 0 ] << 24;
    case 3:
        result += bytecode[ 1 ] << 16;
    case 2:
        result += bytecode[ 2 ] << 8;
    case 1:
        result += bytecode[ 3 ];
        break;
    default:
        // error handling here
        break;
    }

    return( result );
}

Then, sign-extend it (borrowing from this answer):

static uint32_t sign_extend_uint32( uint32_t in, size_t size );
{
    if ( size == 4 )
    {
        return( in );
    }

    // being pedantic here - the existence of `[u]int32_t` pretty
    // much ensures 8 bits/byte
    size_t bits = size * CHAR_BIT;

    uint32_t m = 1U << ( bits - 1 );

    uint32_t result = ( in ^ m ) - m;
    return ( result );
}

Put it all together:

static int32_t  bytecode_to_int32( const uint8_t *bytecode, size_t size )
{
    uint32_t result = bytecode_to_uint32( bytecode, size );

    result = sign_extend_uint32( result, size );

    // set endianness from network/big-endian to
    // whatever this host's endianness is
    result = ntohl( result );

    // converting uint32_t here to signed int32_t
    // can be subject to implementation-defined
    // behavior
    return( result );
}

Note that the conversion from uint32_t to int32_t implicitly performed by the return statement in the above code can result in implemenation-defined behavior as there can be uint32_t values that can not be mapped to int32_t values. See this answer.

Any decent compiler should optimize that well into inline functions.

I personally think this also needs much better error handling/input validation.

Andrew Henle
  • 32,625
  • 3
  • 24
  • 56