1

In C (using gcc), is it possible to use a 4 char [a-z] "literal" ( e.g. "enus"/{'e','n','u','s'}/... ) as an uint16_t ?

I'd like to be able to use 4 char locale ( like 'en-us', but 'enus' is okay as well, as the '-' is superfluous afaik) as an unsigned 16 bit integer directly in my code without any runtime overhead.

E.g. 'en-us' could be mapped to (('e' - 96) << 12) | (('n' - 96) << 8) | (('u' - 96) << 4) | ('s' - 96). (This is only an example, I'm fine with any mapping/algorithm that at least leaves the value 0 untouched (for detecting "nothing set"))

Any solution would not need to be portable (WRT endianess etc), but should have no runtime overhead over actually using a uint16_t.

Thank you very much!

P.S.: Feel free to add more tags to the question, wasn't sure what to use other than "c". Thx.

griffin
  • 1,261
  • 8
  • 24
  • 1
    Characters are normally 8 bits each; squeezing 32-bits into a 16-bit number doesn't work readily. Even if you only encode the alphabet, you need 5 bits per letter, so 4 letters requires 20 bits, which won't fit into a 16-bit integer. Going with 32 bits for the number is reasonable; 16 bits is probably pushing just a little too hard. – Jonathan Leffler Sep 18 '13 at 14:54
  • 1
    Unless you use a custom character encoding, you can't encode the ASCII values of 4 characters into a `uint16_t` but rather into a `uint32_t`. Frankly, it's probably more run-time overhead than a string since you are sacrificing time for space in that case. The time is needed to pack and unpack the string from the `uint` in exchange for the 1 or 2 bytes you're saving not using a zero-terminated string (with or without a hyphen). – lurker Sep 18 '13 at 14:55
  • Ah damn thanks for the comments, I totally overlooked that simple math there. I was originally gonna ask how to do the same with 2 characters - use iso639-1 - but then it occured to me that en-us and en-gb have different spellings for some words, and other languages may as well ... Gonna edit the question so it's actually answerable. Thanks! – griffin Sep 18 '13 at 15:01
  • Thinking more about it editing the question would make it a different one, so I'm gonna leave it as is and accept the first of the provided answers for now. Thank you all for pointing out the obvious I missed! – griffin Sep 18 '13 at 15:05
  • 1
    http://stackoverflow.com/questions/4165131/c-c-switch-for-non-integers/18592322#18592322 – abelenky Sep 18 '13 at 15:07
  • @abelenky though this doesn't answer my original question, it actually helps a lot, so - thank you very much! – griffin Sep 18 '13 at 15:08

3 Answers3

2

Treatment of multi-character constants is specified in the GCC documentation. GCC evaluates a multi-character constant by shifting the previous value left by the number of bits per character and ORing the new character.

When the target uses eight-bit characters (which is most common today), four characters will not fit in a uint16_t. To use a uint16_t, you would need to define your own mapping from some literals to a uint16_t.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
1

You can encode 4 values of 4 bits in a 16-bit integer. With 4 bits you can encode 16 different characters.

Sure, you can encode "enus" in a 16-bit integer if you get to choose how you encode each character, but you can't encode every 4-letter string. There are more than 16 letters in English so some letters just can't be represented.

Joni
  • 108,737
  • 14
  • 143
  • 193
0

The only thing i could come up with was simply using the uint16_t as an index into a table (of strings or whatever). If you are not memory constrained, then you can do this easily and with little overhead.

PeterK
  • 6,287
  • 5
  • 50
  • 86