C / C++ UTF-8 upper/lower case conversions

Question

The Problem: There is a method with a corresponding test-case that works on one machine and fails on the other (details below). I assume there's something wrong with the code, causing it to work by chance on the one machine. Unfortunately I cannot find the problem.

Please note that the usage of std::string and utf-8 encoding are requirements I have no real influence on. Using C++ methods would be totally fine, but unfortunately I failed to find anything. Hence the use of C-functions.

The method:

std::string firstCharToUpperUtf8(const string& orig) {
  std::string retVal;
  retVal.reserve(orig.size());
  std::mbstate_t state = std::mbstate_t();
  char buf[MB_CUR_MAX + 1];
  size_t i = 0;
  if (orig.size() > 0) {
    if (orig[i] > 0) {
      retVal += toupper(orig[i]);
      ++i;
    } else {
      wchar_t wChar;
      int len = mbrtowc(&wChar, &orig[i], MB_CUR_MAX, &state);
      // If this assertion fails, there is an invalid multi-byte character.
      // However, this usually means that the locale is not utf8.
      // Note that the default locale is always C. Main classes need to set them
      // To utf8, even if the system's default is utf8 already.
      assert(len > 0 && len <= static_cast<int>(MB_CUR_MAX));
      i += len;
      int ret = wcrtomb(buf, towupper(wChar), &state);
      assert(ret > 0 && ret <= static_cast<int>(MB_CUR_MAX));
      buf[ret] = 0;
      retVal += buf;
    }
  }
  for (; i < orig.size(); ++i) {
    retVal += orig[i];
  }
  return retVal;
}

The test:

TEST(StringUtilsTest, firstCharToUpperUtf8) {
  setlocale(LC_CTYPE, "en_US.utf8");
  ASSERT_EQ("Foo", firstCharToUpperUtf8("foo"));
  ASSERT_EQ("Foo", firstCharToUpperUtf8("Foo"));
  ASSERT_EQ("#foo", firstCharToUpperUtf8("#foo"));
  ASSERT_EQ("ßfoo", firstCharToUpperUtf8("ßfoo"));
  ASSERT_EQ("Éfoo", firstCharToUpperUtf8("éfoo"));
  ASSERT_EQ("Éfoo", firstCharToUpperUtf8("Éfoo"));
}

The failed test (only happens on one of two machines):

Failure
Value of: firstCharToUpperUtf8("ßfoo")
  Actual: "\xE1\xBA\x9E" "foo"
Expected: "ßfoo"

Both machine have the locale en_US.utf8 installed. They however use different versions of libc. It works on the machine with GLIBC_2.14 independent of where it was compiled and doesn't work on the other machine, while it can only be compiled there, because otherwise it lacks the proper libc version.

Either way, there is a machine that compiles this code and runs it while it fails. There has to be something wrong with the code and I wonder what. Pointing to C++ methods (STL in particular), would also be great. Boost and other libraries should be avoided due to other outside requirements.

In Unicode, if you are operating on single code points at a time you're doing it wrong. Conversion operations only make sense on ranges. — JoeG, Sep 19 '12 at 11:19
small case sharp s : ß; upper case sharp s : ẞ. Did you use the uppercase version in your assert ? Seems like glibg 2.14 follows unwind point of view (pre unicode5.1 no upper case version) and on the other machine the libc uses unicode 5.1 ẞ=U1E9E ... — Kwariz, Sep 19 '12 at 11:37
@Joe Gauterin: I don't. I look at the first char of something that is possibly unicode and if it doesn't degrade to ASCII, I work on ranges, hence the use of len. — b.buchhold, Sep 19 '12 at 11:58
@Kwariz: Thanks a lot. I didn't know such a character existed. That actually solved the whole problem! Maybe you want to turn this comment into an answer. — b.buchhold, Sep 19 '12 at 12:00
the real problem: using the standard libraries with unicode. solution: use windows API on Windows, use ICU for unix. — David Haim, Jun 01 '16 at 14:35
While I really like this solution overall, one should replace `orig[i] > 0` with `(orig[i] & (1 << 7)) == 0` as the original test does not work on systems where `char` is unsigned (e.g. Linux on ARM) — Niklas Schnelle, Jul 11 '18 at 11:01

Gelldur · Answer 1 · 2013-09-08T23:26:54.490

10

Maybe someone would use it (maybe for tests)

With this you could make simple converter :) No additional libs :)

http://pastebin.com/fuw4Uizk

1482 letters

Example

Ь <> ь
Э <> э
Ю <> ю
Я <> я
Ѡ <> ѡ
Ѣ <> ѣ
Ѥ <> ѥ
Ѧ <> ѧ
Ѩ <> ѩ
Ѫ <> ѫ
Ѭ <> ѭ
Ѯ <> ѯ
Ѱ <> ѱ
Ѳ <> ѳ
Ѵ <> ѵ
Ѷ <> ѷ
Ѹ <> ѹ
Ѻ <> ѻ
Ѽ <> ѽ
Ѿ <> ѿ
Ҁ <> ҁ
Ҋ <> ҋ
Ҍ <> ҍ
Ҏ <> ҏ
Ґ <> ґ
Ғ <> ғ
Ҕ <> ҕ
Җ <> җ
Ҙ <> ҙ
Қ <> қ
Ҝ <> ҝ
Ҟ <> ҟ
Ҡ <> ҡ
Ң <> ң

edited Sep 08 '13 at 23:26

answered Sep 08 '13 at 23:15

Gelldur

11,187
7
57
68

Do you remember how you made this list? I'm trying to make use of it, but a few sanity checks failed (not sorted, duplicates), so it may have been damaged on its way to pastebin, my browser, my clipboard or my IDE. Now I'm trying to get it as `char32_t`s in hex. – Cygon Nov 05 '19 at 20:16
@Cygon I think I found some list online and manually I tidy it up. Other solution is to use Python and print such list on your own. E.g. `print("ĄŻŹĆ".lower())` – Gelldur Nov 06 '19 at 13:25

Gerhard Wesp · Answer 2 · 2018-01-22T16:00:27.200

The following C++11 code works for me (disregarding for a moment the question of how the sharp s should be translated---it's left unchanged. It's slowly being phased out from German anyway).

Optimizations and uppercasing the first letter only are left as an exercise.

Edit: As pointed out, codecvt appears to have been deprecated. It should remain in the standard, however, until a suitable replacement is defined. See Deprecated header <codecvt> replacement

#include <codecvt>
#include <iostream>
#include <locale>

std::locale const utf8("en_US.UTF-8");

// Convert UTF-8 byte string to wstring
std::wstring to_wstring(std::string const& s) {
  std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
  return conv.from_bytes(s);
}

// Convert wstring to UTF-8 byte string
std::string to_string(std::wstring const& s) {
  std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
  return conv.to_bytes(s);
}

// Converts a UTF-8 encoded string to upper case
std::string tou(std::string const& s) {
  auto ss = to_wstring(s);
  for (auto& c : ss) {
    c = std::toupper(c, utf8);
  }
  return to_string(ss);
}

void test_utf8(std::ostream& os) {
  os << tou("foo" ) << std::endl;
  os << tou("#foo") << std::endl;
  os << tou("ßfoo") << std::endl;
  os << tou("Éfoo") << std::endl;
}    

int main() {
  test_utf8(std::cout);
}

BTW, there are no German words with a sharp S at the beginning. — Gerhard Wesp, Oct 10 '15 at 21:24
Notice that codecvt has been [deprecated since C++17](http://en.cppreference.com/w/cpp/header/codecvt) — usernameiwantedwasalreadytaken, Jan 19 '18 at 11:41

score 1 · Answer 3 · edited Sep 19 '12 at 11:22

1

What do you expect the upper-case version of the German ß character to be, for that test case?

In other words, your basic assumptions are wrong.

Note that the Wikipedia in the comment states:

Sharp s is nearly unique among the letters of the Latin alphabet in that it has no traditional upper case form (one of the few other examples is kra, ĸ, which was used in Greenlandic). This is because it never occurs initially in German text, and traditional German printing (which used blackletter) never used all-caps. When using all-caps, the current spelling rules require the replacement of ß with SS.[1] However, in 2010 its use became mandatory in official documentation when writing geographical names in all-caps.[2]

So, the basic test case, with the sharp s occuring as an initial, is violating the rules of German. I still think I have a point, in that the original posters premise is wrong, strings cannot in general be freely converted between upper and lower case, for all languages.

edited Sep 19 '12 at 11:22

default

11,485
9
66
102

answered Sep 19 '12 at 11:09

unwind

391,730
64
469
606

@KerrekSB Thanks for the reference, I added some quoted text from it which I feel strengthen my argument ... – unwind Sep 19 '12 at 11:14
2

This is just a needlessly distracting example. It'd be much simpler to use Hebrew, Arabic, Chinese, or any Indic writing system as an example where capitalization doesn't make sense. – Kerrek SB Sep 19 '12 at 11:20
It should not change, exactly as the expected test test and like the case for #foo. This is according to the man pages for towupper. Unfortunaltey many sstrings violate the rules of the language. take band names, movies, or wikipedia page titles if you want exmaples: http://en.wikipedia.org/wiki/%C3%9F Strings may start with this character (unlike German words) and should be converted by leaving the initial character as is. – b.buchhold Sep 19 '12 at 11:52
@KerrekSB: Thanks a lot. I didn#t know this character existed, either. Seem to be the source of my problem... – b.buchhold Sep 19 '12 at 12:02
1

There are also dotless i and dotted I in Turkish. So, depending on the locale i<->I can be right or wrong just as I<->ı and i<->İ. – Alexey Frunze Sep 19 '12 at 12:03
@b.buchhold: Yes, indeed, it's `U+1E9E`, which is what you seem to be getting. Everybody wins. – Kerrek SB Sep 19 '12 at 12:05

score 1 · Accepted Answer · answered Sep 19 '12 at 12:10

1

small case sharp s : ß; upper case sharp s : ẞ. Did you use the uppercase version in your assert ? Seems like glibg 2.14 follows implements pre unicode5.1 no upper case version of sharp s, and on the other machine the libc uses unicode 5.1 ẞ=U1E9E ...

answered Sep 19 '12 at 12:10

Kwariz

1,306
8
11

7

This is wrong. Many code points have a 1-to-mapping between cases. You have to casemap strings not characters, or your results suck. The correct upcase of U+00DF is "SS". It is *not* U+1E9E!! See the UCD. – tchrist Sep 19 '12 at 13:05
@tchrist Wrong ? Well, at least it depends on the designer and the users point of views. U+1E9E has for unicode categories letter and uppercase, refering as U+00DF as lowercase version. Does this reflect the general usage in the german tongue, I really don't know but after having read the comments found on [this blog](http://www.fontblog.de/typografen-feiern-das-versal-eszett) I doubt. But you are right, as it is not widely used, the correct uppercase version of a Word beginning with a sharp s should be SS (or SZ if you ask german typographer) ... – Kwariz Sep 19 '12 at 13:24
@tchrist, the "solution" is just one fitting the users point of view (maybe even the point of view of the end users). So what's wrong ? Not following UCD ? – Kwariz Sep 19 '12 at 13:26
1

I suspect in the years that come, most computer users will want/expect `ẞ`as the uppercase for `ß` and find the one-to-two mapping annoying and old fashioned... – R.. GitHub STOP HELPING ICE Sep 19 '12 at 18:11
4

@tchrist: Upcasing U+00DF by "SS" is not correct for "in Maßen" (in small, modest [amounts]), because it will result "IN MASSEN" (in massive, large [amounts]). Maßen and Massen are different words in German, in fact opposites, similar for Maße (measures) and Masse (mass). – Secure Sep 19 '12 at 19:41
It is even more complicated. When you have an ß in your name, then it is not allowed to change it in official documents. If you're called "Heinz Große", then the required uppercase version for anything official is "HEINZ GROßE". – Secure Sep 19 '12 at 19:56
As complicated and debatable as the upper/lower issue of 'ß' may be, may question evolved around the different behavior on different machines. The unicode standard being subject to changes and glibg versions having to decide on either verion, answers my particular question perfectly well. This is why I accepted this answer gladly – b.buchhold Sep 09 '13 at 09:17

score 0 · Answer 5 · answered Jun 01 '16 at 14:20

The issue is your locales that do not assert are compliant, your locales on which the assert does fire are non-compliant.

Technical Report N897 required in B.1.2[LC_CTYPE Rationale]:

As the LC_CTYPE character classes are based on the C Standard character-class definition, the category does not support multicharacter elements. For instance, the German character is traditionally classified as a lowercase letter. There is no corresponding uppercase letter; in proper capitalization of German text the will be replaced by SS; i.e., by two characters. This kind of conversion is outside the scope of the toupper and tolower keywords.

This Technical Report was published in Dec-25-'01. But according to: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

In 2010, the use of the capital ẞ became mandatory in official documentation in Germany when writing geographical names in all-caps

But the topic has not been revisited by the standard committee, so technically independent of what the German government says, the standardized behavior of toupper should be to make no changes to the ß character.

The reason this works inconsistently over machines is setlocale:

Installs the specified system locale or its portion as the new C locale

So it is non-compliant system locale, en_US.utf8 that is instructing toupper to modify the ß character. Unfortunately, the specialization ctype<char>::clasic_table, is not available on ctype<wchar_t> so you cannot modify the behavior. Leaving you with 2 options:

Create a const map<wchar_t, wchar_t> for conversion from every possible lowercase wchar_t to the corresponding uppercase wchar_t

Add a check for an L'ß' like this:

int ret = wcrtomb(buf, wChar == L'ß' ? L'ẞ' : towupper(wChar), &state);

Live Example

C / C++ UTF-8 upper/lower case conversions

5 Answers5

Linked