5

I have a string of 256*4 bytes of data. These 256* 4 bytes need to be converted into 256 unsigned integers. The order in which they come is little endian, i.e. the first four bytes in the string are the little endian representation of the first integer, the next 4 bytes are the little endian representation of the next integer, and so on.

What is the best way to parse through this data and merge these bytes into unsigned integers? I know I have to use bitshift operators but I don't know in what way.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
user0123
  • 259
  • 1
  • 6
  • 17
  • "but i don't know in what way" - you read up on how shifting operators work and hopefully you will instantly know how. –  Jul 29 '13 at 05:09
  • the string is just passed through via a redirected file. The first 256*4 bytes are the little endian encodings of 256 unsigned integers. I need to convert each 4 bytes into an unsigned integer and store it in an array. What i dont know how to do is merge each set of 4 bytes into an unsigned int. – user0123 Jul 29 '13 at 05:10
  • @user0123 `byte0 | (byte1 << CHAR_BIT) | (byte2 << 2 * CHAR_BIT) | (byte3 << 3 * CHAR_BIT)`... –  Jul 29 '13 at 05:10
  • @H2CO3 - I have read far and wide on google about how the bitshifting operators work, including the & and | operators. I am still extremely confused on how to merge 4 bytes into an unsigned int – user0123 Jul 29 '13 at 05:10
  • @user0123 Just like my comment above ^^ explains it. –  Jul 29 '13 at 05:11
  • @H2CO3 - Can you explan that code a little bit? I am pretty confused by it, what is CHAR_BIT? – user0123 Jul 29 '13 at 05:11
  • @user0123 Googled it? (Nah...) It's a macro from `` which expands to the number of bits in a byte on your platform. –  Jul 29 '13 at 05:12
  • @H2CO3 - sorry i am still kind of confused. Can you explain your code a little more? I really appreciate the input. How does 'or'ing all four bytes together give me an unsigned integer? – user0123 Jul 29 '13 at 05:14
  • @H2CO3 -- shifting by CHAR_BIT is incorrect. On some systems CHAR_BIT is 16. Here, the OP specifically said they were bytes, not chars. And he wants the in 4 byte multiples, that is 32bits. The correct shift is 8. – Nitzan Shaked Jul 29 '13 at 05:16
  • @user0123 It doesn't OR all four bytes together. It ORs the first byte, the second byte shifted to the left 8 (or whatever) places, etc. Write it down on a piece of paper and you'll see why this works. –  Jul 29 '13 at 05:17
  • @NitzanShaked A `char` is always a byte. It's just that they need not be 8 bits long. You are confusing "byte" with "octet". –  Jul 29 '13 at 05:18
  • ahh i see, but when you shift the second, third, and fourth bytes by that much why wouldnt the data just fall off? I thought shifting maintains the amount of bits. – user0123 Jul 29 '13 at 05:18
  • @user0123 Ah, I see what you mean... Indeed, because of the "usual arithmetic conversions" (or whatever it's called in the Standard exactly), in an expression `unsigned char << int`, the char is promoted (implicitly converted) to `unsigned int` (or is it `int`? Somebody who speaks C++ better, please confirm this!), so you will be getting the expected result. –  Jul 29 '13 at 05:22

3 Answers3

6

Hope this helps you

unsigned int arr[256];
char ch[256*4] = "your string";
for(int i = 0,k=0;i<256*4;i+=4,k++)
{
arr[k] = ch[i]|ch[i+1]<<8|ch[i+2]<<16|ch[i+3]<<24;
}
Saksham
  • 9,037
  • 7
  • 45
  • 73
  • what if host system is big endian? – fatihk Jul 29 '13 at 05:23
  • 2
    @thomas This approach is *endian-agnostic*. It does not matter what the host system's endianness is. – jamesdlin Jul 29 '13 at 05:27
  • @Saksham Don't believe the false positive ;) –  Jul 29 '13 at 05:28
  • @thomas If the host system is big-endian, then this will work just like it would work on a little-endian system. (And not doing what OP wanted, and invoking undefined behavior by shifting stuff into the sign bit of the resulting integer...) –  Jul 29 '13 at 05:28
  • This worked perfectly! and i believe it is hardware independent. Thanks so much! – user0123 Jul 29 '13 at 05:36
  • @Saksham, http://stackoverflow.com/questions/1001307/detecting-endianness-programmatically-in-a-c-program – fatihk Jul 29 '13 at 05:36
  • @user0123 No, it doesn't "work perfectly". You wanted unsigned integers, this gives you signed integers, and in addition, it invokes undefined behavior. –  Jul 29 '13 at 05:37
  • 1
    @Saksham This solution is almost correct, you should just change `int` to `unsigned int`. (Edit: done by OP, +1.) –  Jul 29 '13 at 05:39
  • @H2CO3, then do we need to assume that integer to char conversion is done in a similar bit shift operation? – fatihk Jul 29 '13 at 06:04
  • @thomas Sorry, I don't understand, what do you mean by that? Here we don't need to assume anything - as you can see in the code, the conversion **is** done using the required bitwise operations. –  Jul 29 '13 at 06:41
  • @H2CO3, on the sender side, we have an contiguous array of unsigned integers and during conversion to char pointers, if we just cast integer array starting location to char * and send this char array then bit shift conversion may not work on the receiver side because endianness information is not handled in such a case. – fatihk Jul 29 '13 at 06:47
  • @thomas Huh? I think you should read the question more carefully, or maybe I am miserably misinterpreting it. "I have a string of 256*4 bytes of data. These 256* 4 bytes need to be converted into 256 unsigned integers. **The order in which they come is little endian.**" (emphasis mine) –  Jul 29 '13 at 06:49
4

Alternatively, we can use C/C++ casting to interpret a char buffer as an array of unsigned int. This can help get away with shifting and endianness dependency.

#include <stdio.h>
int main()
{
    char buf[256*4] = "abcd";
    unsigned int *p_int = ( unsigned int * )buf;
    unsigned short idx = 0;
    unsigned int val = 0;
    for( idx = 0; idx < 256; idx++ )
    {
        val = *p_int++;
        printf( "idx = %d, val = %d \n", idx, val );
    }
}

This would print out 256 values, the first one is idx = 0, val = 1684234849 (and all remaining numbers = 0).

As a side note, "abcd" converts to 1684234849 because it's run on X86 (Little Endian), in which "abcd" is 0x64636261 (with 'a' is 0x61, and 'd' is 0x64 - in Little Endian, the LSB is in the smallest address). So 0x64636261 = 1684234849.

Note also, if using C++, reinterpret_cast should be used in this case:

const char *p_buf = "abcd";
const unsigned int *p_int = reinterpret_cast< const unsigned int * >( p_buf );
artm
  • 17,291
  • 6
  • 38
  • 54
0

If your host system is little-endian, just read along 4 bytes, shift properly and copy them to int

char bytes[4] = "....";
int i = bytes[0] | (bytes[1] << 8) | (bytes[2] << 16) | (bytes[3] << 24);

If your host is big-endian, do the same and reverse the bytes in the int, or reverse it on-the-fly while copying with bit-shifting, i.e. just change the indexes of bytes[] from 0-3 to 3-0

But you shouldn't even do that just copy the whole char array to the int array if your PC is in little-endian

#define LEN 256
char bytes[LEN*4] = "blahblahblah";
unsigned int uint[LEN];
memcpy(uint, bytes, sizeof bytes);

That said, the best way is to avoid copying at all and use the same array for both types

union
{
    char bytes[LEN*4];
    unsigned int uint[LEN];
} myArrays;

// copy data to myArrays.bytes[], do something with those bytes if necessary
// after populating myArrays.bytes[], get the ints by myArrays.uint[i]
phuclv
  • 37,963
  • 15
  • 156
  • 475