As you've discovered, the classic conversion routines like the CRT's toupper
and Win32's CharUpper
are rather dumb. They generally hail from the time when all the world was assumed to be ASCII.
What you need is a linguistically-sensitive conversion. This is a computationally more expensive operation, but also very difficult to implement correctly. Languages are hard. So you want to offload the responsibility if at all possible to a standard library. Since you're using MFC, you're obviously targeting the Windows operating system, which means you're in luck. You can piggyback on the hard work of Microsoft's localization engineers, giving the additional benefit of consistency with the shell and other OS components.
The function you need to call is LCMapStringEx
(or LCMapString
if you are still targeting pre-Vista platforms). The complexity of this function's signature serves as strong testament to the complicated task of proper linguistically-aware string handling.
- First, you need to choose a locale. You usually want the user's default locale, which you can specify with
LOCALE_NAME_USER_DEFAULT
, but you can use anything you want here.
- For the flags, you will want
LCMAP_UPPERCASE | LCMAP_LINGUISTIC_CASING
. To do the reverse operation, you'd use LCMAP_LOWERCASE | LCMAP_LINGUISTIC_CASING
. There are lots of other interesting and useful options here to keep in mind, too.
- Then you have a pointer to the source string, and its length in characters (code units).
- And a pointer to a string buffer that receives the results, as well as its maximum length in characters (code units).
- The final three parameters can simply be set to NULL or 0.
Putting it all together:
BOOL ConvertToUppercase(std::wstring& buffer)
{
return LCMapStringEx(LOCALE_NAME_USER_DEFAULT /* or whatever locale you want */,
LCMAP_UPPERCASE | LCMAP_LINGUISTIC_CASING,
buffer.c_str(),
buffer.length(),
&buffer[0],
buffer.length(),
NULL,
NULL,
0);
}
Note that I'm doing an in-place conversion here of the contents of the buffer, and therefore assuming that the uppercased string is exactly the same length as the original input string. This is probably true, but may not be a universally safe assumption, so you will either want to add handling for such errors (ERROR_INSUFFICIENT_BUFFER
) and/or defensively add some extra padding to the buffer.
If you'd prefer to use CRT functions like you're doing now, _totupper_l
and its friends are wrappers around LCMapString
/LCMapStringEx
. Note the _l
suffix, which indicates that these are the locale-aware conversion functions. They allow you to pass an explicit locale, which will be used in the conversion.