%s minimum field width in the presence of unicode characters

Question

So, here's my problem:

If someone wants to output visually aligned strings using printf, they'll obviously use %<n>s (where <n> is the minimum field width). And this works just fine, unless one of the strings contains unicode (UTF-8) characters.

Take this very basic example:

#include <stdio.h>

int main(void)
{
    char* s1 = "\u03b1\u03b2\u03b3";
    char* s2 = "abc";

    printf("'%6s'\n", s1);
    printf("'%6s'\n", s2);

    return 0;
}

which will produce the following output:

'αβγ'
'   abc'

This isn't all that surprising, because printf of course doesn't know that \u03b1 (which consists of two characters) only produces a single glyph on the output device (assuming UTF-8 is supported).

Now assume that i generate s1 and s2, but have no control over the format string used to output those variables. My current understanding is that nothing i could possibly do to s1 would fix this, because i'd have to somehow fool printf into thinking that s1 is shorter than it actually is. However, since i also control s2, my current solution is to add a non-printing character to s2 for each unicode character in s1, which would look something like this:

#include <stdio.h>

int main(void)
{
    char* s1 = "\u03b1\u03b2\u03b3";
    char* s2 = "abc\x06\x06\x06";

    printf("'%6s'\n", s1);
    printf("'%6s'\n", s2);

    return 0;
}

This will produce the desired output (even though the actual width no longer corresponds to the specified field width, but i'm willing to accept that):

'αβγ'
'abc'

For context:

The example above is only to illustrate the unicode-problem, my actual code involves printing numbers with SI-prefixes, only one of which (µ) is a unicode character. Therefore i would generate strings containing only up to one normal or unicode character (which is why i can accept the resulting offset in the field-width).

So, my questions are:

Is there a better solution for this?
Is \x06 (ACK) a sensible choice (i.e. a character without undesired side-effects)?
Can you think of any problems with this approach?

Really doubt there's a better solution to this. The character to choose and whether or not this will cause problems only depends on where the output is supposed to go. — Marco Bonelli, Apr 30 '20 at 10:49

score 0 · Answer 1 · answered Apr 30 '20 at 11:03

0

Since the non ascii is restricted to µ, I believe there is a solution. I've taken value of µ to be \u00b5. Replace it with the correct value

I've coded a small function myPrint which takes input the string and the width n. You should be able to modify the code below to fit to your needs.

The function searches for all occurrences of µ and increments that much of width to the string

#include <stdio.h>

void myPrint(char* string, int n)
{
  char* valueOfNu = "\u00b5";
  for(int i=0;string[i]!='\0';i++)
  {
    if(string[i]==valueOfNu[0] && string[i+1]==valueOfNu[1])
      n++;
  }

  printf("%*s",n,string);
}


int main(void)
{
    char* s1 = "ab\u00b5";
    char* s2 = "abc";

    myPrint(s1,6);
    printf("\n");
    myPrint(s2,6);
    printf("\n");

    return 0;
}

answered Apr 30 '20 at 11:03

Abhay Aravinda

878
6
17

Unfortunately, that violates my constraint of not being able to modify the format string. My actual use case looks something like this: `printf("x=%sm\n", si_print(x))` where `si_print()` is written by me, but the code calling `printf()` isn't. – Felix G Apr 30 '20 at 11:08
@FelixG Sorry. But I still feel it's problematic to use non printable characters. For instance, if the output is being redirected to a file, the file stores the characters and would display those weird looking rectangular symbols where the non printables should be. Another problem I can see is if someone tries to modify the string. For instance,`printf("a\x06");` prints `a`. But if someone decides to append a character to the string, say 'a', `printf("a\x06a");` prints `aj` instead of `aa`. Since you mentioned SI units being used, I feel this has a probability of occurring. – Abhay Aravinda Apr 30 '20 at 12:52
The output being redirected to a file is definitely a concern, i haven't tried that yet. As for the other issue, while `printf("a\x06a")` does output `aj`, `printf("a%sa", "\x06")`(which is closer to my use-case) doesn't. I definitely agree though, that the usage of non printable characters feels risky, which is why i posted this question. – Felix G Apr 30 '20 at 13:25
Okay, i've just tested what it would look like after writing to a file, and the results are unfortunately very messy (as you expected). – Felix G Apr 30 '20 at 13:42
1

@FelixG after taking everything into consideration, I feel the best thing to do is use `\u00a0` which is the non-breaking space. It is non ascii. You can use it along with the normal space `' '` to adjust the size. Since its a printable character, it shouldn't be much of an issue even if it prints a space. Only problems are it prevents word wrap and that someone may mistake it to a normal space causing problems. But looking at other options, I feel its the second best thing to do (first being, modifying the printf directly). Does your use case support using nbsp? – Abhay Aravinda Apr 30 '20 at 14:28
1

@FelixG I found out about `\u200c`. There's a whole bunch of spaces here at https://stackoverflow.com/questions/8515365/are-there-other-whitespace-codes-like-nbsp-for-half-spaces-em-spaces-en-space – Abhay Aravinda Apr 30 '20 at 18:53

%s minimum field width in the presence of unicode characters

1 Answers1