0

I'm developing a multi-language piece of software (first time working on anything other than English).

I've made code that reads in multiple localization files, the user then selects their language, and that localization file is used.

This all works fine and dandy, but when I try to display symbols from foreign languages (like Korean) it does not show the correct symbols.

Is there something special I need to do to store Chinese, Korean, Japanese, etc into strings? One of my Korean Localization files looks like this....

[Labels]
Username=사용자 이름
Password=암호

So in my code I have a function that gets the designated string like this...

const std::string& UsernameLabel = GetLocalizationString("Korean", "Labels", "Username");
const std::string& PasswordLabel = GetLocalizationString("Korean", "Labels", "Password");
Rick
  • 353
  • 1
  • 16
  • 1
    How do you display them? By printing into standard output? – HolyBlackCat Jan 31 '18 at 07:59
  • 4
    "Foreign language" seems not the real issue. The real issue is probably the handling of non-ASCII characters. Also, your foreign language is someone's native language. –  Jan 31 '18 at 08:00
  • May want to look into `std::wstring`, but I'm not sure what encoding you are working with. – user4581301 Jan 31 '18 at 08:07
  • How are these strings encoded? Maybe UTF-8 or ISO-2022-KR? And what is the encoding of the terminal? What is shown instead of the correct symbols? – Olaf Dietsche Jan 31 '18 at 08:08
  • 1
    UTF8 is probably what you are looking for. https://stackoverflow.com/questions/3011082/how-to-write-a-stdstring-to-a-utf-8-text-file – schorsch312 Jan 31 '18 at 08:09
  • yes, UTF8 with std::string is the choice. Forget std::wstring and wchar_t, unless operating wih the Windows API. – The Techel Jan 31 '18 at 08:10
  • @HolyBlackCat I will be displaying them via std::cout, but right now I am just checking their value in the debugger watch list (they turn into gibberish). – Rick Jan 31 '18 at 08:13
  • @TheTechel I've never had to change encoding's before. Do I need to set my Visual Studio project to user a UTF8 encoding (somehow) or just the text file that stores all the localization? – Rick Jan 31 '18 at 08:15
  • On Windows the dominant encoding seems to be [UTF-16](https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows), see also https://stackoverflow.com/q/166503/1741542 – Olaf Dietsche Jan 31 '18 at 08:17

2 Answers2

5

The root of the issue is std::string itself as it deals with chars (that is equal to 1 byte in most cases). As soon as you plan to develop multi-language software, you have to do one of the following:

  • Use std::wstring as it deals with "wide chars" (usually 2 bytes on Windows). Easy to do, covers most cases.
  • Step away from standard string classes and use UTF-8 (or UTF-32 etc.) encoding to represent UI info. Thus, it means working with byte buffers, not strings because some symbols are encoded with multiple bytes, some bytes are not symbols at all (like emoji modifiers for skin color, gender etc.). The most correct approach, may be time-consuming.

Update: also, you may find this discussion useful: std::wstring VS std::string

Melebius
  • 6,183
  • 4
  • 39
  • 52
Yury Schkatula
  • 5,291
  • 2
  • 18
  • 42
  • Interesting side note, according to the C++ standard, "The sizeof operator yields the number of bytes" and "sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1" (Quoting N4700, [expr.sizeof]) so I'm pretty sure that means no matter how many bits are in a `char`, it is still one byte. – user4581301 Jan 31 '18 at 08:43
  • @user4581301 As far as the C++ standard is concerned, a byte and a `char` are pretty much the same thing. – Sebastian Redl Jan 31 '18 at 09:28
  • "wide chars" have some of the same problems as UTF-8 - i.e. some symbols are represented by more than one 16-bit "wide chars"; so if you want to do it right you gain very little by doing that. – Hans Olsson Jan 31 '18 at 10:13
  • @HansOlsson, as long as you work with UTF-8 as a byte stream/buffer, "there is no char". Just let system-level API to render that bytes on the screen. – Yury Schkatula Jan 31 '18 at 11:07
  • @YurySchkatula agreed for UTF-8, my point is that this also holds for "wide chars", and you cannot just drop the right-most "wide character" if the string is too long or assume that 10 wide characters are twice as long as 5 wide characters (on the screen). – Hans Olsson Jan 31 '18 at 11:11
1

Wide characters should suit your situation,Such as unicode,I am from above country, hope can help you.

EricChiu
  • 26
  • 1