3

I'm learning about unicode in C++ and I have a hard time getting it to work properly. I try to treat the individual characters as uint64_t. It works if all I need it for is to print out the characters, but the problem is that I need to convert them to uppercase. I could store the uppercase letters in an array and simply use the same index as I do for the lowercase letters, but I'm looking for a more elegant solution. I found this similar question but most of the answers used wide characters, which is not something I can use. Here is what I have attempted:

#include <iostream>
#include <locale>
#include <string>
#include <cstdint>
#include <algorithm>

// hacky solution to store a multibyte character in a uint64_t
#define E(c) ((((uint64_t) 0 | (uint32_t) c[0]) << 32) | (uint32_t) c[1])

typedef std::string::value_type char_t;
char_t upcase(char_t ch) {
    return std::use_facet<std::ctype<char_t>>(std::locale()).toupper(ch);
}

std::string toupper(const std::string &src) {
    std::string result;
    std::transform(src.begin(), src.end(), std::back_inserter(result), upcase);
    return result;
}

const uint64_t VOWS_EXTRA[]
{
E("å")  , E("ä"), E("ö"), E("ij"), E("ø"), E("æ")
};

int main(void) {
    char name[5];
    std::locale::global(std::locale("sv_SE.UTF8"));
    name[0] = (VOWS_EXTRA[3] >> 32) & ~((uint32_t)0);
    name[1] = VOWS_EXTRA[3] & ~((uint32_t)0);
    name[2] = '\0';
    std::cout << toupper(name) << std::endl;
}

I expect this to print out the character IJ but in reality it prints out the same character as it was in the beginning (ij).


(EDIT: OK, so I read more about the unicode support in standard C++ here. It seems like my best bet is to use something like ICU or Boost.locale for this task. C++ essentially treats std::string as a blob of binary data so doesn't seem to be an easy task to uppercase unicode letters properly. I think that my hacky solution using uint64_t isn't in any way more useful than the C++ standard library if not even worse. I'd be grateful for an example on how to achieve the behaviour stated above using ICU.)

Community
  • 1
  • 1
Linus
  • 1,516
  • 17
  • 35
  • Please do not try to pretend that Unicode is a fixed-width encoding. – Nicol Bolas Sep 18 '16 at 19:01
  • @NicolBolas sorry I am very inexperienced with unicode, I tried using regular strings but couldn't get it to work with single characters. – Linus Sep 18 '16 at 19:04
  • The `std::locale::global(std::locale("sv_SE.UTF8"))` is incompatible with Windows, unless you use a very special compiler. Microsoft's runtime does not support UTF-8 locales. See docs of `setlocale`. – Cheers and hth. - Alf Sep 19 '16 at 14:50
  • To include UTF-8 literals just use e.g. `u"Oh, it's that easy?"`. The main problem is that the basic character type is still `char`. – Cheers and hth. - Alf Sep 19 '16 at 14:52
  • Uppercasing and lowercasing for full Unicode can't in general be done on a character by character basis. Some times a single character maps to two characters in the opposite case. And I think for Greek it depends on the position of the character within a word, or at the end (or was it start?). For the truly pedantic it can't even be done in a locale-independent manner (this is a special problem with Turkish), but I think that fine point is ignored by nearly all software. – Cheers and hth. - Alf Sep 19 '16 at 14:53

2 Answers2

4

Have a look at the ICU User Guide. For simple (single-character) case mapping, you can use u_toupper. For full case mapping, use u_strToUpper. Example code:

#include <unicode/uchar.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>

int main() {
    UChar32 upper = u_toupper(U'ij');
    u_printf("%lC\n", upper);

    UChar src = u'ß';
    UChar dest[3];
    UErrorCode err = U_ZERO_ERROR;
    u_strToUpper(dest, 3, &src, 1, NULL, &err);
    u_printf("%S\n", dest);

    return 0;
}
nwellnhof
  • 32,319
  • 7
  • 89
  • 113
  • Thanks, sorry for the late accepted answer. It took me several hours to get ICU working. I had a lot of problems with "undefined reference to function" errors. – Linus Sep 23 '16 at 17:56
0

also if anyone else is looking for it, std::towupper and std::towlower seemed to work fine https://en.cppreference.com/w/cpp/string/wide/towupper

smitt
  • 304
  • 1
  • 3
  • 8