Bests practices for localized texts in C++ cross-platform applications?

Question

In the current C++ standard (C++03), there are too few specifications about text localization and that makes the C++ developer's life harder than usual when working with localized texts (certainly the C++0x standard will help here later).

Assuming the following scenario (which is from real PC-Mac game development cases):

responsive (real time) application: the application has to minimize non-responsive times to "not noticeable", so speed of execution is important.
localized texts: displayed texts are localized in more than two languages, potentially more - don't expect a fixed number of languages, should be easily extensible.
language defined at runtime: the texts should not be compiled in the application (nor having one application per language), you get the chosen language information at application launch - which implies some kind of text loading.
cross-platform: the application is be coded with cross-platform in mind (Windows - Linux/Ubuntu - Mac/OSX) so the localized text system have to be cross platform too.
stand-alone application: the application provides all that is necessary to run it; it won't use any environment library or require the user to install anything other than the OS (like most games for example).

What are the best practices to manage localized texts in C++ in this kind of application?

I looked into this last year that and the only things I'm sure of are that you should use std::wstring or std::basic_string<ABigEnoughType> to manipulate the texts in the application. I stopped my research because I was working more on the "text display" problem (in the case of real-time 3D), but I guess there are some best practices to manage localized texts in raw C++ beyond just that and "use Unicode".

So, all best-practices, suggestions and information (cross-platform makes it hard I think) are welcome!

Asked and answered: http://stackoverflow.com/questions/185291/best-way-to-design-for-localization-of-strings#185356 — Martin York, Dec 31 '08 at 18:56
Thanks for pointing the question, it's pretty similar (if you forget MFC usage). But the answer don't seem totally right to me... the current higher answer require compile time localization... — Klaim, Dec 31 '08 at 19:02

Aaron · Accepted Answer · 2009-01-01T15:18:10.537

At a small Video Game Company, Black Lantern Studios, I was the Lead developer for a game called Lionel Trains DS. We localized into English, Spanish, French, and German. We knew all the languages up front, so including them at compile time was the only option. (They are burned to a ROM, you see)

I can give you information on some of the things we did. Our strings were loaded into an array at startup based on the language selection of the player. Each individual language went into a separate file with all the strings in the same order. String 1 was always the title of the game, string 2 always the first menu option, and so on. We keyed the arrays off of an enum, as integer indexing is very fast, and in games, speed is everything. ( The solution linked in one of the other answers uses string lookups, which I would tend to avoid.) When displaying the strings, we used a printf() type function to replace markers with values. "Train 3 is departing city 1."

Now for some of the pitfalls.

1) Between languages, phrase order is completely different. "Train 3 is departing city 1." translated to German and back ends up being "From City 1, Train 3 is departing". If you are using something like printf() and your string is "Train %d is departing city %d." the German will end up saying "From City 3, Train 1 is departing." which is completely wrong. We solved this by forcing the translation to retain the same word order, but we ended up with some pretty broken German. Were I to do it again, I would write a function that takes the string and a zero-based array of the values to put in it. Then I would use markers like %0 and %1, basically embedding the array index into the string. Update: @Jonathan Leffler pointed out that a POSIX-compliant printf() supports using %2$s type markers where the 2$ portion instructs the printf() to fill that marker with the second additional parameter. That would be quite handy, so long as it is fast enough. A custom solution may still be faster, so you'll want to make sure and test both.

2) Languages vary greatly in length. What was 30 characters in English came out sometimes to as much as 110 characters in German. This meant it often would not fit the screens we were putting it on. This is probably less of a concern for PC/Mac games, but if you are doing any work where the text must fit in a defined box, you will want to consider this. To solve this issue, we stripped as many adjectives from our text as possible for other languages. This shortened the sentence, but preserved the meaning, if loosing a bit of the flavor. I later designed an application that we could use which would contain the font and the box size and allow the translators to make their own modifications to get the text fit into the box. Not sure if they ever implemented it. You might also consider having scrolling areas of text, if you have this problem.

3) As far as cross platform goes, we wrote pretty much pure C++ for our Localization system. We wrote custom encoded binary files to load, and a custom program to convert from a CSV of language text into a .h with the enum and file to language map, and a .lang for each language. The most platform specific thing we used was the fonts and the printf() function, but you will have something suitable for wherever you are developing, or could write your own if needed.

Note that POSIX-compliant versions of the printf() family support the '%2$s' notation to say "this string format item comes from argument 2 (the '2$' part). This allows you to internationalize the order if you use different format strings for different locales. — Jonathan Leffler, Dec 31 '08 at 20:05
Oh? I was not aware of that. I'm pretty certain we didn't have that on the DS, but that would certainly be a good place to start on PC/Mac. With any solution, you would want to make sure it is fast enough for your uses. Various implementaitons of printf() may be too slow for your needs. YMMV. =D — Aaron, Dec 31 '08 at 20:24
This is almost exactly how we do it at Halfbrick Studios. The only difference is that we have a few specialized tags (for things like inserting text or changing font colour) and we build a hash table for the "name" of each string, which allows them to be accessed via scripts. — Grant Peters, Feb 27 '09 at 19:10

Peter · Answer 2 · 2019-10-17T12:26:28.307

12

I strongly disagree with the accepted answer. First, the part about using static array lookups to speed up the text lookups is counterproductive premature optimization - Calculating the layout for said text and rendering said text uses 2-4 orders of magnitude more time than a hash lookup. If anyone wanted to implement their own language library it should never be based on static arrays, because doing so trades real benefits (translators don't need access to the code) for imaginary benefits (speed increase of ~0.01%).

Next, writing your own language library to use in your own game is even worse than premature optimization. There are some extremely good reasons to never write your own localization library:

Planning the time to use an existing localization library is much easier than planning the time to write a localization library. Localization libraries exist, they work, and many people have used them.
Localization is tricky, so you will get things wrong. Every language adds a new quirk, which means whenever you add a new language to your own homegrown localization library you will need to change code again to account for the quirks. Did you know that some languages have more than 2 plural forms, depending on the number of items in question? More than 2 genders (more than 10, even)? Also, the number and date formats vary a lot between different in many languages.
When your application becomes successful you will want add support for more languages. Languages nobody on your team speaks fluently. Hiring someone to write a translation will be considerably cheaper if they already know the tools they are working with.

A very well known and complete localization library is GNU Gettext, which uses the GPL, and should therefore be avoided for commercial work. You can instead use the boost library boost.locale which works with Gettext files, and is free to use and modify for commercial and non-commercial projects of any kind.

edited Oct 17 '19 at 12:26

answered Jul 19 '15 at 22:19

Peter

5,608
1
24
43

First, it is about text localisation only and without things like localized money. Second, I did/do use boost.locale and totally failed in the case of games. The dictionary control is too inapropriate and limited by what gettext allows you to do. For example, I can't load then unload a dictionary, nor control the associated memory. In the end, for games text, it seems that another solution is necessary, something more focused than gettext (wich also don't work on console for example). – Klaim Jul 20 '15 at 12:49
Finally, I disagree with the performance statement, it actually depends a lot on what kind of game it is. If it's text heavy and real-time, of course it is a performance concern. However, I do agree that there should be a localization library that match concerns specifics to game texts. Unfortunately there is none at the moment and all the game companies I know have their own solution, sometime game-specific instead of studio-specific. It's a sad state of things but it only means that there is an oppportunity for another text localization library. – Klaim Jul 20 '15 at 12:50
The Complete Works of William Shakespeare are 2.3 MB, so I don't share your concerns about memory. In regards to CPU performance, I can't imagine a game that will need to do more than 20 text lookups per screen, which even combined will still take less time than converting a single floating point number like 3.1415926535897932384 to a string. I can imagine lots of games that have to convert dozens of floating point numbers to strings every single frame. We are now extremely deep in the territory of premature optimization. – Peter Jul 20 '15 at 13:39
What's the size of the complete work once translated in Japanese or Russian? I assume UTF-8 encoding. Well I did work on games in contexts where we had less than 1Mo allowed for text in RAM and other constraints like that (cartridges costs vs memory, etc.) It was on older consoles yes but it still happen (less today, I agree). On the other points I agree that it might seem like premature optimization, until it's not. For example, loading text in contiguous memory is important for avoiding fragmentation in general, which impact global speed (got perf boosts in doing so). – Klaim Jul 20 '15 at 18:08
I don't know if floating point numbers are often used in games, it seems very rare to me (I mean outside developer tools). That's actually a good point but I think most games avoid them or use english notation because it's readable everywhere (but I might be wrong). Or do you have examples? boost.locale/gettext still don't solve the issues I pointed with loading/unloading. Maybe another library which don't focus much on perfs but just load and unload blocs of translatex texts in memory would help. – Klaim Jul 20 '15 at 18:10
"Uses GPL and should therefore be avoided for commercial work" - there are two problems with this sentence: 1. It doesn't use GPL, it uses LGPL, so you can link against it without also releasing your code under GPL nor LGPL. 2. GPL libraries are great for commercial work, if you are not forbidden from using them; great in the sense that they increase the likelihood that your own code is kept publicly-accessible and not hidden behind some corporate wall. – einpoklum Jul 30 '21 at 18:57

score 7 · Answer 3 · answered Dec 31 '08 at 18:43

7

GNU Gettext does it all.

answered Dec 31 '08 at 18:43

Milan Babuškov

59,775
49
126
179

It's not cross-platform, isn't it? – Klaim Dec 31 '08 at 18:55
2

seems it is, there's sections for C#, Java and ObjectiveC as well as the usual linux languages. – gbjbaanb Jan 01 '09 at 15:27
1

Yes, gettext is cross-platform. I use it at work in both linux and windows. – David Jan 08 '09 at 11:52
2

It is GNU GPL licensed, which significantly reduces its usage. – Alex Che May 05 '20 at 10:52

score 0 · Answer 4 · answered Dec 31 '08 at 19:04

0

There won't be any additional features in the C++0x standard, as far as I can tell. I suspect the Committee considers this a matter for third-party libraries.

answered Dec 31 '08 at 19:04

David Thornley

56,304
9
91
158

Won't Unicode encoded characters help? http://en.wikipedia.org/wiki/C%2B%2B0x#New_string_literals – Klaim Dec 31 '08 at 19:07
Thank you, I had missed that change. Of course, there's a lot it doesn't cover. – David Thornley Dec 31 '08 at 20:35

Bests practices for localized texts in C++ cross-platform applications?

Assuming the following scenario (which is from real PC-Mac game development cases):

What are the best practices to manage localized texts in C++ in this kind of application?

4 Answers4

Linked