1

I am trying to extract character value from UTF-8 format. Suppose I have two characters, and I extract 5 bits from first character => 10111 and 6 bits from another character => 010000

so

ch1 = 10111;
ch2 = 010000;

how would I combine them to form 10111010000 and output its hex as 0x5d0? Do I need to shift or is there an easier way to do this, because checking the documentation write appear to be able to read characters sequentially, is there a similar function like this? Also, it appears I would need a char buffer since 10111010000 is 11 bits long. Does any know how to go about this?

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
Mark
  • 8,408
  • 15
  • 57
  • 81

4 Answers4

3

You need to use shifting, plus the | or |= operator.

unsigned int ch3 = (ch1 << 6) | ch2;
// ch3 = 0000010111010000

I'm assuming here that an unsigned int is 16 bits. Your mileage may vary.

Maxpm
  • 24,113
  • 33
  • 111
  • 170
  • I need up to 21 bits to read largest utf8. How would I do that? – Mark Jul 01 '11 at 07:46
  • 1
    And then, to print in hex, `std::cout << std::showbase << std::hex;` – juanchopanza Jul 01 '11 at 07:52
  • 2
    @Mark I'd look into [`std::bitset`](http://www.cplusplus.com/reference/stl/bitset/). Alternatively, you can use an `unsigned long int`, which is guaranteed to be at least 32 bits. – Maxpm Jul 01 '11 at 07:53
  • @Mark, you can check the size of unsigned int with `sizeof`. On my platform, on C++, unsigned int is 4 bytes (strictly speaking, 4 chars), which is 32 bits, so it is OK for your purposes. Or do you need to combine up to 2*21 bits? – juanchopanza Jul 01 '11 at 07:56
2

You will definitely need to use shift and OR.

First, declare an unsigned integer type of the right size. I like the C99 types defined in stdint.h but your C++ compiler may not have them. If you don't have uint16_t then you can use unsigned short. That is 16 bits wide and can hold 11 bits.

Then you would figure out which bits go into the high bits. It looks like it should be:

unsigned short ch1 = 0x17;
unsigned short ch2 = 0x10;
unsigned short result = (ch1 << 6) | ch2;
Zan Lynx
  • 53,022
  • 10
  • 79
  • 131
  • The largest extraction takes up to 21 bits. Do I need a char buffer[]? – Mark Jul 01 '11 at 07:48
  • @Mark, no, see this thread: http://stackoverflow.com/questions/589575/c-size-of-int-long-etc. According to that, the standard requires `unsigned long` to be 32 bits. – juanchopanza Jul 01 '11 at 08:00
0

1: for combining them together:

char bytes[2] = { 0x17, 0x10 }; // for example

unsigned short result = 0;      // 00000000  00000000
result = bytes[0] << 6;         // 101 11000000
result |= bytes[1];             // 101 11010000

2: for printing it out as hex

std::cout << std::showbase << std::hex << <what you want to print>;

in this case:

std::cout << std::showbase << std::hex << result
// output: 0x5d0 if it is little-endian, it depends on your operating system
runo
  • 46
  • 1
0

First, from K&R: "Almost everything about bitfields is implementation dependent".

The following works on MS Visual Studio 2008:

#include <stdio.h>
#include <string.h>

struct bitbag {
    unsigned int ch2 : 6;
    unsigned int ch1 : 6;
};

int main ()
{
    struct bitbag bits;

    memset(&bits, 0, sizeof(bits));

    bits.ch1 = 0x17;    // 010111
    bits.ch2 = 0x10;    // 010000

    printf ("0x%06x 0x%06x\n", bits.ch1, bits.ch2);
    printf ("0x%0x\n", bits);

    return 0;
}

Produces the output:

0x000017 0x000010
0x5d0

However I could not guarentee that it will work in the same way in all compilers. Note the memset which initialises any padding to zero.

cdarke
  • 42,728
  • 8
  • 80
  • 84