Internationalize C program

Question

I have a C program written for some embedded device in English. So there are codes like:

SomeMethod("Please select menu");
OtherMethod("Choice 1");

Say I want to support other languages, but I don't know how much memory I have with this device. I don't want to store strings in other memory areas where I might have less space and crash the program. So I want to store strings in the same memory area and take the same space. So I thought of this:

SomeMethod(SELECT_MENU);
OtherMethod(CHOICE_1);

And a separate header file:

English.h

#define SELECT_MENU "Please select menu"
#define CHOICE_1 "Choice 1"

For other languages:

French.h

#define SELECT_MENU "Text in french"
#define CHOICE_1 "same here"

Now depending which language I want I would include that header file only.

Does this satisfy the requirement that if I select English version my internationalized programs' strings will be stored on same memory region and take same memory as my previous one? (I know French might take more - but that is other issue related that French letters take more bytes).

I thought since I will use defines strings will be placed at same place in memory they were before.

If the different languages aren't the same number of letters, it will have to take up a different amount of memory... — Daniel, Jun 16 '15 at 18:55
@Daniel: Yes that I know. I meant even if I use English it should be stored in same memory segment (and it will probably take same space..). Maybe I asked question in confusing way? I know of course french will take more bytes because utf8 may encode letter in many bytes say whereas with ASCII that is not the case — , Jun 16 '15 at 18:58
String literals are always stored in the data segment (and are usually read-only) regardless of where / how you declare them. — Filipe Gonçalves, Jun 16 '15 at 18:59
Although many compilers will generate the _exact_ same code each time the same compiler, options, source code, (aside from maybe `__TIME__`), I would not even count on that. So even with same source code, do not recommend assuming same program/data layout occured. — chux - Reinstate Monica, Jun 16 '15 at 19:00
@Filipe Gonçalves "String literals are always stored in the data segment" is certainly not defined by the C spec. Far too many compilers out there that use various memory models for the statement to be always true. — chux - Reinstate Monica, Jun 16 '15 at 19:02
@chux: So you say even if I use english textwith macro enabled program - strings maybe stored in different memory area as compared to non macro program (with english text)? — , Jun 16 '15 at 19:03
What is the embedded application, what is the device (and its microcontroller), do you use some OS (e.g. some Embedded Linux)? — Basile Starynkevitch, Jun 16 '15 at 19:19
@user70012: you should **edit your question** to tell this and improve it. — Basile Starynkevitch, Jun 16 '15 at 19:28
@user 70012 Yes. Most compilers are deterministic, but I have come across optimizing compiles that use time-of-day to vary some approaches. The key point is: although it is likely your compiler will provide consistent results, a re-build should certainly provide consistent _functionality_ - not necessarily consistent exact binary code. If you need many strings in a certain place, adjust to place text sequentially, like in a structure and reference via the first element. Better yet, re-write to not need this consistent placement. — chux - Reinstate Monica, Jun 16 '15 at 22:50
@Stevoisiak - Your instinct is correct here; for lessons learned see: https://barrgroup.com/embedded-systems/how-to/firmware-internationalization. — Scott Prive, Sep 27 '22 at 18:50
@Stevoisiak - All the suggestions of "use gettext" are not considering your concern that you can't have every languages loaded at once (gettext will keep all languages in memory) — Scott Prive, Sep 27 '22 at 18:53

Basile Starynkevitch · Answer 1 · 2015-06-16T19:32:10.107

6

At least on Linux and many other POSIX systems, you should be interested by gettext(3) (and by the positioning arguments in printf(3), e.g. %3$d instead of %d in the control format string).

Then you'll code

 printf(gettext("here x is %d and y is %d"), x, y);

and that is common enough to have the habit to

#define _(X) gettext(X)

and code later

printf(_("here x is %d and y is %d"), x, y);

You'll also want to process message catalogs with msgfmt(1)

You'll find several documents on internationalization (i18n) and localization, e.g. Debian Introduction to i18n. Read also locale(7). And you probably should always use UTF-8 today.

The advantage of such message catalogs (all this is by default already available on Linux systems!) is that the internationalization happens at runtime. There is no reason to restrict it to happen at compile time. Message catalogs can (and often are) translated by other people that the developers. You'll have directories in your file system (e.g. in some cheap flash memory, like some SD chip) containing these.

Notice that internationalization & localization is a difficult subject (read more documentation to understand how difficult it can be, once you want to handle non-European languages), and the Linux infrastructure has designed it quite well (probably better, and more efficient, than what you are suggesting with your macros). And Qt and Gtk have also extensive support for internationalization (based upon gettext etc...).

edited Jun 16 '15 at 19:32

answered Jun 16 '15 at 19:11

Basile Starynkevitch

223,805
18
296
547

I am afraid if I or maybe others overcomplicate this -maybe I asked the question in confusing way. I am basically just interested that the macro version of the program (English text) be allocated on same memory segment and take same space as non macro version of the program (English text) – Jun 16 '15 at 19:22
If your OS is Linux, do it the conventional and standard way, so use `gettext` like I am suggesting. – Basile Starynkevitch Jun 16 '15 at 19:23
I don't want to overcomplicate this really.. if I need to dig too much about this I will just hardcode strings manually :) and be fine with it; ps installing stuff on this device even just copying some stuff is not trivial usually – Jun 16 '15 at 19:24
You'll actually *simplify* things by using `gettext` and follow Linux habits, since the system already have support for these. Your macro-based solution is non-standard, sub-optimal, and error-prone. – Basile Starynkevitch Jun 16 '15 at 19:26
"non-standard, sub-optimal, and error-prone." why? No like I said if I don't figure out things I will use non macro approach and hardcode strings manually – Jun 16 '15 at 19:28
Because you have an OS providing a good infrastructure for that, and you don't use it (and suggest something much less efficient). – Basile Starynkevitch Jun 16 '15 at 19:29
"You'll have directories in your file system containing these." like I said accessing this device like this I didn't even manage so far :) just know how to write compiled files in it ... etc. – Jun 16 '15 at 19:29
and your approach doesn't have memory issues like I outlined? Anyway like I said to "talk" to this device isn't easy to I usually avoid installing some libraries etc. or anything to it, I don't know anything beyond writing compiled binary to this device yet – Jun 16 '15 at 19:31
Will `gettext` work on other platforms? (Windows, Android, iOS, etc) – Stevoisiak Jan 02 '20 at 18:35
No; internationalization and localization are platform-specific. Qt & GTK has been ported to many platforms – Basile Starynkevitch Jan 02 '20 at 20:50

score 2 · Answer 2 · edited May 23 '17 at 12:22

Let me get this straight: You want to know that if preprocessor-defined variables (in your case, related to i18n) were swapped out before compile, that they would (a) take the same amount of memory (between the macro and non-macro version) and (b) be stored in the same program segment?

The short answer is (a) yes and (b) yes-ish.

For the first part, this is easy. Preprocessor-defined constants are whole-text replaced with their #define'd values by the preprocessor before being passed into the compiler. So, to the compiler,

#define SELECT_MENU "Please select menu"
// ...
SomeMethod(SELECT_MENU);

is read in as

SomeMethod("Please select menu");

and therefore will be identical for all intents and purposes except for how it appears to the programmer.

For the second part, this is a bit more complex. If you have constant string literals in a C program, they will be allocated either into the program's data segment or (if declared as the initial contents of a self-allocating char array) built dynamically within the program's code segment and stored either on the stack or the heap, if I'm not mistaken (as discussed in the answers to this question). This is dependent on how the preprocessor-defined constant is used in the program.

Considering what I said in the first part, if you have char buffer[] = MY_CONSTANT;, it is likely be stored as a heap-space allocator and initializer where it is used in the program, and will increase the code segment (and possibly the BSS). If you have someFunction(MY_CONSTANT);, or char* c_str = MY_CONSTANT;, then it will likely be stored in the data segment, and you will receive a pointer to that area at runtime. There are many ways this may manifest in your actual program; having the variables #define'd does not reliably determine how they will be stored in your compiled program, although if they are used in certain ways only, then you can be reasonably certain where it will be stored.

EDIT Modified first half of answer to accurately address what is being asked, thanks to @esm's comment.

The asker is not asking whether the English vs. French version will take the same amount of memory but whether the English non-macro version will take the same amount of memory as the English macro-version. — missimer, Jun 16 '15 at 19:14
char buffer[] = MY_CONSTANT; - this I don't have I am talking about hardcoded literals — , Jun 16 '15 at 19:44
Then the answer is yes(ish). As long as you're only passing references to the string literal, then all instances of an identical string literal will resolve to a pointer to that literal within the data segment. — Shotgun Ninja, Jun 16 '15 at 19:46
The caveat is that if you do eventually include a line like the above, and the compiler resolves it first, then it will probably build that in code (since it's faster at runtime than copying it from the data segment into stack or heap space). This depends on the compiler and its settings as well; setting compilation options for a smaller program size will probably make array initialization copy from the data segment instead of an inline initializer (esp. if the string literal is in the data segment anyway). — Shotgun Ninja, Jun 16 '15 at 19:48

score 1 · Answer 3 · edited May 23 '17 at 12:08

1

To answer the question of will this take the same amount of memory and will strings be placed in the same section of the program for the English non-macro version when using English macro version the answer is yes.

The C preprocessor (CPP) will replace all instances of the macro with the correct language string for the given language and after the CPP run it will be as if the macros were never there. The strings will still be placed in the read only data section of the binary, assuming that is supported, just as if you didn't use macros.

So to summarize the English version with macros and the English version without macros are the same as far as the C compiler is considered, see link

edited May 23 '17 at 12:08

Community

1
1

answered Jun 16 '15 at 18:57

missimer

4,022
1
19
33

Yes I am mainly interested now that by storing strings in global vars say I don't store in some memory segment where they wont fit...From usability point of view this approach is also not very bad IMO. But if I put them in defines I thought *nothing* would change basically for english text – Jun 16 '15 at 19:01
For just simple cases this approach is probably suitable, again I am not an expert in internationalization (you don't often find it in kernels). And you are correct the program will be identical, as what is passed to the guts of the C compiler is the output of the CPP, which will be identical. – missimer Jun 16 '15 at 19:04

Barmak Shemirani · Answer 4 · 2015-06-16T19:42:01.963

The way you are doing that, if you compile the program as English, then French words will not be stored in the English version of the program.

The compiler will not even see the French words. The French words will not be in the final executable.

In some cases, the compiler may see some data, but it chooses to ignore that data if the data is not being used in the program.

For example, consider this function:

void foo() {
    cout << "qwerty\n";
}

If you define this function, but you don't use it in the program, then the function foo and the string "qwerty" will not find their way in the final executable.

Using macro doesn't make any difference. For example, foo1 and foo2 are identical.

#define SOME_TEXT "qwerty\n"
void foo2() {
    cout << SOME_TEXT;
}

The data is stored in heap, heap limit is usually very large. There won't be shortage of memory unless SOME_TEXT is bigger than stack limit (usually about 100 kb) and this data is being copied in stack.

So basically you don't have anything to worry about except the final size of the program.

I know what I am looking for now is if my macro based program will store english text in different memory area than the non macro program? (this is what I am actually interested in). Also if it will take same size — , Jun 16 '15 at 19:07

Weather Vane · Answer 5 · 2015-06-16T19:35:45.793

1

The pre-processor use here is simple substitution: there is no difference in the executable code between

SomeMethod("Please select menu");

and

#define SELECT_MENU "Please select menu"
...
SomeMethod(SELECT_MENU);

But the memory usage is unlikely to be exactly the same for each language.

In practice, messages are often more complicated than a simple translation. For example in the message

Input #4 is dangerous

Would you have

#define DANGER "Input #%d is dangerous"
...
printf(DANGER, inpnum);

Or would you do

#define DANGER "Dangerous input #"
...
printf(DANGER); 
printf("%d", inpnum);

I use these examples to show that you must consider language versions from the outset, not as an easy post-fix.

Since you mention "a device" and are concerned with memory usage, I guess you are working with embedded. My own preferred method is to provide language modules containing an array of words or phrases, with #define to reference the array element to use to piece together a message. That could also be done with enum.

For example (would actually include the English language source file separately

#include <stdio.h>

char *message[] = { "Input", 
                    "is dangerous" };

#define M_INPUT     0
#define M_DANGER    1

int main()
{
    int input = 4;
    printf ("%s #%d %s\n", message[M_INPUT], input, message[M_DANGER]);
    return 0;
}

Program output:

Input #4 is dangerous

edited Jun 16 '15 at 19:35

answered Jun 16 '15 at 19:25

Weather Vane

33,872
7
36
56

"But the memory usage is unlikely to be exactly the same for each language.". I am ONLY interested if memory usage (where strings are stored and nr. of bytes) is same for macro based version of the program (for English text) and non macro version of the program (also Engish text). That's it – Jun 16 '15 at 19:35
Well since languages are different, you will have to contrive your library of messages so the total memory is the same. Nobody but you can do that. Example "thank you" and "merci". – Weather Vane Jun 16 '15 at 19:37
I explained at the start of my answer, there is **NO DIFFERENCE** between memory requirement of macro and non-macro. – Weather Vane Jun 16 '15 at 19:39
1

"Well since languages are different, you will have to contrive your library of messages so the total memory is the same. Nobody but you can do that. Example "thank you" and "merci". " --> This I am not interested in, like I said I realize french may take more bytes and I am ok with it. I am just interested if macro based vs non macro based (English text for both) takes: (1) same amount of space and (2) stores strings in same memory area. – Jun 16 '15 at 19:43
Please read the comments and answers that many people have taken time and trouble to compose. But most of all, please read up what the preprocessor does with `#define`, which is most simply stated as **text substitution before compilation** – Weather Vane Jun 16 '15 at 19:45
I have done that but the varying responses is actually what confuses me!. So ps you say in both (1) and (2) cases it doesn't change situation – Jun 16 '15 at 19:48
I would have this "#define DANGER "Input #%d is dangerous"" what is the issue with this? – Jun 16 '15 at 19:58
The issue is that the syntax of languages is different, as well as their vocabulary, so it is not always easy to put together a complex message including numerical or other information. – Weather Vane Jun 16 '15 at 20:02
I think in my case that is not issue. I guess I will stick with defines if I get no difference with memory(also why your approach is better than defines?). Global variables like you define I avoided indeed because I was not sure where they would be stored etc – Jun 16 '15 at 20:05

Internationalize C program

5 Answers5