3

I am working on translating a system from python to c++. I need to be able to perform actions in c++ that are generally performed by using Python's struct.unpack (interpreting binary strings as numerical values). For integer values, I am able to get this to (sort of) work, using the data types in stdint.h:

struct.unpack("i", str) ==> *(int32_t*) str; //str is a char* containing the data

This works properly for little-endian binary strings, but fails on big-endian binary strings. Basically, I need an equivalent to using the > tag in struct.unpack:

struct.unpack(">i", str) ==> ???

Please note, if there is a better way to do this, I am all ears. However, I cannot use c++11, nor any 3rd party libraries other than Boost. I will also need to be able to interpret floats and doubles, as in struct.unpack(">f", str) and struct.unpack(">d", str), but I'll get to that when I solve this.

NOTE I should point out that the endianness of my machine is irrelevant in this case. I know that the bitstream I receive in my code will ALWAYS be big-endian, and that's why I need a solution that will always cover the big-endian case. The article pointed out by BoBTFish in the comments seems to offer a solution.

ewok
  • 20,148
  • 51
  • 149
  • 254
  • An interesting read: http://commandcenter.blogspot.co.uk/2012/04/byte-order-fallacy.html – BoBTFish Dec 13 '12 at 16:08
  • 1
    @BoBTFish are you saying that my code is "wrong or misguided", or pointing out the solution offered in 4th paragraph? – ewok Dec 13 '12 at 16:13
  • Neither really. Well maybe the second one. Just pointing to an article that discusses this that I found interesting. I don't really feel qualified to offer a proper answer, but no one else was saying anything at all. – BoBTFish Dec 13 '12 at 16:17

5 Answers5

7

For 32 and 16-bit values:

This is exactly the problem you have for network data, which is big-endian. You can use the the ntohl to turn a 32-bit into host order, little-endian in your case.

The ntohl() function converts the unsigned integer netlong from network byte order to host byte order.

int res = ntohl(*((int32_t) str)));

This will also take care of the case where your host is big-endian and won't do anything.

For 64-bit values

Non-standardly on linux/BSD you can take a look at 64 bit ntohl() in C++?, which points to htobe64

These functions convert the byte encoding of integer values from the byte order that the current CPU (the "host") uses, to and from little-endian and big-endian byte order.

For windows try: How do I convert between big-endian and little-endian values in C++?

Which points to _byteswap_uint64 and as well as a 16 and 32-bit solution and a gcc-specific __builtin_bswap(32/64) call.

Other Sizes

Most systems don't have values that aren't 16/32/64 bits long. At that point I might try to store it in a 64-bit value, shift it and they translate. I'd write some good tests. I suspectt is an uncommon situation and more details would help.

Community
  • 1
  • 1
Paul Rubel
  • 26,632
  • 7
  • 60
  • 80
  • 2 things: 1) note that you need to dereference the casted pointer for that to compile: `ntohl(*(int32_t*) str);`. 2) how will I handle values other than 16 and 32 bit ints? I need to be able to cover everything from 8 to 64 bit ints, both signed and unsigned. – ewok Dec 13 '12 at 19:41
  • Thanks for the fix, it's added. Also tried to address the other size issue. – Paul Rubel Dec 13 '12 at 22:37
4

Unpack the string one byte at a time.

unsigned char *str;
unsigned int result;

result =  *str++ << 24;
result |= *str++ << 16;
result |= *str++ << 8;
result |= *str++;
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • This works for integers wit 32 bits. Modify with moreor fewer <<'s as needed for 16 and 64 bit values. – Wes Miller Dec 13 '12 at 16:24
  • It doesn't have a point. I suppose it would have a point if `str` were a `signed char*`. I've removed them. – Robᵩ Dec 13 '12 at 19:07
  • @Robᵩ if str is signed, how will the `& 0xff` help? I've realized that I am reading in the data as `char*`, but it doesnt convert properly unless I cast it to `unsigned char*`. is there a problem just doing it that way? – ewok Dec 13 '12 at 19:43
  • @ewok `unsigned char *` is definitely the way to go for binary data. You may unfortunately then have to sprinkle casts around for calling library functions that take plain `char *`, but it's worth it. – zwol Dec 13 '12 at 22:47
2

First, the cast you're doing:

char *str = ...;
int32_t i = *(int32_t*)str;

results in undefined behavior due to the strict aliasing rule (unless str is initialized with something like int32_t x; char *str = (char*)&x;). In practical terms that cast can result in an unaligned read which causes a bus error (a crash) on some platforms and slow performance on others.

Instead you should be doing something like:

int32_t i;
std::memcpy(&i, c, sizeof(i));

There are a number of functions for swapping bytes between the host's native byte ordering and a host independent ordering: ntoh*(), hton*(), where * is nothing, l, or s for the different types supported. Since different hosts may have different byte orderings then this may be what you want to use if the data you're reading uses a consistent serialized form on all platforms.

ntoh(i);

You can also manually move bytes around in str before copying it into the integer.

std::swap(str[0],str[3]);
std::swap(str[1],str[2]);
std::memcpy(&i,str,sizeof(i));

Or you can manually manipulate the integer's value using shifts and bitwise operators.

std::memcpy(&i,str,sizeof(i));
i = (i&0xFFFF0000)>>16 | (i&0x0000FFFF)<<16;
i = (i&0xFF00FF00)>>8  | (i&0x00FF00FF)<<8;
bames53
  • 86,085
  • 15
  • 179
  • 244
0

This falls in the realm of bit twiddling.

for (i=0;i<sizeof(struct foo);i++) dst[i] = src[i ^ mask]; 

where mask == (sizeof type -1) if the stored and native endianness differ.

With this technique one can convert a struct to bit masks:

 struct foo {
    byte a,b;       //  mask = 0,0
    short e;        //  mask = 1,1
    int g;          //  mask = 3,3,3,3,
    double i;       //  mask = 7,7,7,7,7,7,7,7
 } s; // notice that all units must be aligned according their native size

Again these masks can be encoded with two bits per symbol: (1<<n)-1, meaning that in 64-bit machines one can encode necessary masks of a 32 byte sized struct in a single constant (with 1,2,4 and 8 byte alignments).

unsigned int mask = 0xffffaa50;  // or zero if the endianness matches
for (i=0;i<16;i++) { 
     dst[i]=src[i ^ ((1<<(mask & 3))-1]; mask>>=2;
}
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
-1

If your as received values are truly strings, (char* or std::string) and you know their format information, sscanf(), and atoi(), well, really ato() will be your friends. They take well formatted strings and convert them per passed-in formats (kind of reverse printf).

Wes Miller
  • 2,191
  • 2
  • 38
  • 64