Alter Mann's accepted answer is along the correct lines, except that one should not just hardcode a custom function to count the number of bytes in a multibyte string that do not encode to a visible character: You should localize the code with setlocale(LC_ALL, "")
or similar, and strlen(str) - mbstowcs(NULL, str, 0)
to count the number of bytes in the string that do not encode a visible character.
setlocale()
is standard C (C89, C99, C11), but also defined in POSIX.1. mbstowcs()
is standard C99 and C11, and also defined in POSIX.1. Both are also implemented in Microsoft C libraries, so they do work basically everywhere.
Consider the following example program, that prints C strings specified on the command line:
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <stdio.h>
/* Counts the number of (visible) characters in a string */
static size_t ms_len(const char *const ms)
{
if (ms)
return mbstowcs(NULL, ms, 0);
else
return 0;
}
/* Number of bytes that do not generate a visible character in a string */
static size_t ms_extras(const char *const ms)
{
if (ms)
return strlen(ms) - mbstowcs(NULL, ms, 0);
else
return 0;
}
int main(int argc, char *argv[])
{
int arg;
/* Default locale */
setlocale(LC_ALL, "");
for (arg = 1; arg < argc; arg++)
printf(">%-*s< (%zu bytes; %zu chars; %zu bytes extra in wide chars)\n",
(int)(10 + ms_extras(argv[arg])), argv[arg],
strlen(argv[arg]), ms_len(argv[arg]), ms_extras(argv[arg]));
return EXIT_SUCCESS;
}
If you compile the above to example
, and you run
./example aaa aaä aää äää aa€ a€€ €€€ a ä €
the program will output
>aaa < (3 bytes; 3 chars; 0 bytes extra in wide chars)
>aaä < (4 bytes; 3 chars; 1 bytes extra in wide chars)
>aää < (5 bytes; 3 chars; 2 bytes extra in wide chars)
>äää < (6 bytes; 3 chars; 3 bytes extra in wide chars)
>aa€ < (5 bytes; 3 chars; 2 bytes extra in wide chars)
>a€€ < (7 bytes; 3 chars; 4 bytes extra in wide chars)
>€€€ < (9 bytes; 3 chars; 6 bytes extra in wide chars)
>a < (1 bytes; 1 chars; 0 bytes extra in wide chars)
>ä < (2 bytes; 1 chars; 1 bytes extra in wide chars)
>€ < (3 bytes; 1 chars; 2 bytes extra in wide chars)
> < (4 bytes; 1 chars; 3 bytes extra in wide chars)
If the last <
does not line up with the others, it is because the font used is not accurately fixed-width: the emoticon
is wider than normal characters like Ä
, that's all. Blame the font.
The last character is U+1F608 SMILING FACE WITH HORNS, from the Emoticons unicode block, in case your OS/browser/font cannot display it. In Linux, all the above >
and <
line up correctly in all terminals I have, including in the console (non-graphical system console), although the console font does not have the glyph for the emoticon, and instead just shows it as a diamond.
Unlike Alter Mann's answer, this approach is portable, and makes no assumptions about what character set is actually used by the current user.