The width specifier in printf does not work properly with accented characters

Question

I'm trying to format output of some strings in c with the width specifier and the printf-function. However I'm having trouble getting the behaviour I want. It seems that everytime printf encounters the character å, ä or ö the the width reserved for the string gets one position smaller.

A code-snippet to illustrate:

#include <stdio.h>

int main(void)
{
  printf(">%-10s<\n", "aoa");
  printf(">%-10s<\n", "aäoa");
  printf(">%-10s<\n", "aäoöa");
  printf(">%-10s<\n", "aäoöaå");

  return 0;
}

Outputs in my ubuntu linux bash-shell.

>aoa       <
>aäoa     <
>aäoöa   <
>aäoöaå <

I'm looking for advice on how to deal with this. What I want is for all the strings in the snippet above to print within space-padded 10 char wide field like so:

>aoa       <
>aäoa      <
>aäoöa     <
>aäoöaå    <

I also appreciate any insight as to why this is happening or feedback if this is not an issue with other setups.

Are you using UTF-8 encoding? Those characters require 2 bytes, and `printf` might not be UTF-8 aware. — user694733, Feb 16 '16 at 08:36
http://stackoverflow.com/questions/15528359/printing-utf-8-strings-with-printf-wide-vs-multibyte-string-literals — 123, Feb 16 '16 at 08:39

score 7 · Answer 1 · answered Feb 16 '16 at 08:44

7

Use wide character strings and wprintf:

#include <cwchar>
#include <locale.h>

int main(void)
{
  // seems to be needed for the correct output encoding
  setlocale(LC_ALL, "");

  wprintf(L">%-10ls<\n", L"aoa");
  wprintf(L">%-10ls<\n", L"aäoa");
  wprintf(L">%-10ls<\n", L"aäoöa");
  wprintf(L">%-10ls<\n", L"aäoöaå");

  return 0;
}

answered Feb 16 '16 at 08:44

Flopp

1,887
14
24

I used literal strings in the example to keep it brief. In my actual problem I'm getting the strings from a struct. I suppose I'd have to convert these strings to wide character-strings with [mbstowcs()](http://linux.die.net/man/3/mbstowcs) or something? I mean I obviously can't do `wprintf(L">%-10ls<\n", Lsome->member);` – Erik Göök Feb 16 '16 at 12:10

David Ranieri · Accepted Answer · 2016-02-16T09:12:07.250

6

why this is happening?

Take a look to The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

As an alternative to wide chars and under UTF8, you can use this function to count the number of non-ASCII chars, then, you can add the result to the width specifier of printf:

#include <stdio.h>

int func(const char *str)
{
    int len = 0;

    while (*str != '\0') {
        if ((*str & 0xc0) == 0x80) {
            len++;
        }
        str++;
    }
    return len;
}

int main(void)
{
    printf(">%-*s<\n", 10 + func("aoa"), "aoa");
    printf(">%-*s<\n", 10 + func("aäoa"), "aäoa");
    printf(">%-*s<\n", 10 + func("aäoöa"), "aäoöa");
    printf(">%-*s<\n", 10 + func("aäoöaå"), "aäoöaå");
    return 0;
}

Output:

>aoa       <
>aäoa      <
>aäoöa     <
>aäoöaå    <

edited Feb 16 '16 at 09:12

answered Feb 16 '16 at 08:54

David Ranieri

39,972
7
52
94

1

Even though I suppose using wprintf is more reasonable in the long run I ended up using your suggestion. The link was gold. Marking this as accepted. – Erik Göök Feb 16 '16 at 12:37
Note that this method only works with characters whose display width is 1. – Cyker Mar 16 '18 at 14:18
@Cyker, you are right, but this is more related with the fonts of the terminal, even using wide chars: `wprintf(L">%-10ls<\n", L"aäoöa包");` the output is not aligned with the other lines. – David Ranieri Mar 17 '18 at 10:07
1

@KeineLust Almost all fonts display CJK chars in double-width. Beyond those chars, there are symbols that are displayed in either single- or double-width, depending on the font used. I actually had similar problem and considered this method until I needed to handle CJK chars. But OP asked for accented chars. I'm no language expert but I think probably they are single-width in all major languages. – Cyker Mar 18 '18 at 04:36

score 3 · Answer 3 · edited May 23 '17 at 12:10

Alter Mann's accepted answer is along the correct lines, except that one should not just hardcode a custom function to count the number of bytes in a multibyte string that do not encode to a visible character: You should localize the code with setlocale(LC_ALL, "") or similar, and strlen(str) - mbstowcs(NULL, str, 0) to count the number of bytes in the string that do not encode a visible character.

setlocale() is standard C (C89, C99, C11), but also defined in POSIX.1. mbstowcs() is standard C99 and C11, and also defined in POSIX.1. Both are also implemented in Microsoft C libraries, so they do work basically everywhere.

Consider the following example program, that prints C strings specified on the command line:

#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <stdio.h>

/* Counts the number of (visible) characters in a string */
static size_t ms_len(const char *const ms)
{
    if (ms)
        return mbstowcs(NULL, ms, 0);
    else
        return 0;
}

/* Number of bytes that do not generate a visible character in a string */
static size_t ms_extras(const char *const ms)
{
    if (ms)
        return strlen(ms) - mbstowcs(NULL, ms, 0);
    else
        return 0;
}

int main(int argc, char *argv[])
{
    int arg;

    /* Default locale */
    setlocale(LC_ALL, "");

    for (arg = 1; arg < argc; arg++)
        printf(">%-*s< (%zu bytes; %zu chars; %zu bytes extra in wide chars)\n",
               (int)(10 + ms_extras(argv[arg])), argv[arg],
               strlen(argv[arg]), ms_len(argv[arg]), ms_extras(argv[arg]));

    return EXIT_SUCCESS;
}

If you compile the above to example, and you run

./example aaa aaä aää äää aa€ a€€ €€€ a ä €

the program will output

>aaa       < (3 bytes; 3 chars; 0 bytes extra in wide chars)
>aaä       < (4 bytes; 3 chars; 1 bytes extra in wide chars)
>aää       < (5 bytes; 3 chars; 2 bytes extra in wide chars)
>äää       < (6 bytes; 3 chars; 3 bytes extra in wide chars)
>aa€       < (5 bytes; 3 chars; 2 bytes extra in wide chars)
>a€€       < (7 bytes; 3 chars; 4 bytes extra in wide chars)
>€€€       < (9 bytes; 3 chars; 6 bytes extra in wide chars)
>a         < (1 bytes; 1 chars; 0 bytes extra in wide chars)
>ä         < (2 bytes; 1 chars; 1 bytes extra in wide chars)
>€         < (3 bytes; 1 chars; 2 bytes extra in wide chars)
>         < (4 bytes; 1 chars; 3 bytes extra in wide chars)

If the last < does not line up with the others, it is because the font used is not accurately fixed-width: the emoticon is wider than normal characters like Ä, that's all. Blame the font.

The last character is U+1F608 SMILING FACE WITH HORNS, from the Emoticons unicode block, in case your OS/browser/font cannot display it. In Linux, all the above > and < line up correctly in all terminals I have, including in the console (non-graphical system console), although the console font does not have the glyph for the emoticon, and instead just shows it as a diamond.

Unlike Alter Mann's answer, this approach is portable, and makes no assumptions about what character set is actually used by the current user.

Nice answer, you are absolutely right about portability, my function assumes UTF8. — David Ranieri, Feb 16 '16 at 19:37

The width specifier in printf does not work properly with accented characters

3 Answers3