3

Given a sequence of Unicode characters, how can I obtain a string of whitespace characters that has the same width (at least in monospace fonts that display each character with single or double width of the characters from Basic Latin)?

Examples

For example, given the string `\u0061\u0020\u0062\u0020\u0063' with five characters that looks like this:

a b c

('a', space, 'b', space, 'c'), I would like to obtain a string consisting of just five spaces:

\u0020\u0020\u0020\u0020\u0020

and given \u6b22\u8fce\u5149\u4e34 that looks like

欢迎光临

I'd want to obtain a string containing four ideographic spaces: \u3000\u3000\u3000\u3000.

Background

Here is an example where this matters: error reporting in compilers for languages that support Unicode. Suppose that we have some hypothetical programming language PL (could be Python, Java, Scala, Ruby ...) that has string literals and parentheses. Suppose that this is an invalid snippet of PL-code, because it contains an unmatched parenthesis:

"stringLiteral")

If we tried to compile it, the compiler of PL could produce an error message that looks as follows:

:1: error: ';' expected but ')' found.
   "stringLiteral")
                  ^

Note the "phantom string" followed by '^' in the last line: it points exactly at the unmatched closing parenthesis.

If I try the same with CJK characters, here is what I get:

:1: error: ';' expected but ')' found.
   "欢迎光临欢迎光临欢迎光临欢迎光临欢迎光临欢迎")
                           ^

Note that now the "phantom string" in the last line consists of ordinary Latin whitespaces, and in the console, the '^' looks as if it's somewhere in the middle of the string of the CJK characters, instead of at the parenthesis.

If I try the same with Croatian characters:

:1: error: ';' expected but ')' found.
   "DŽDždžLJLjljNJNjnj")
              ^

the '^' pointer also ends up at a seemingly completely wrong position, because those special Croatian characters are much wider than ordinary spaces.

All of the examples produce similar results in such languages as Python, Java, Scala, Ruby (just copy-paste " y⃝e҈s҉ ") or "临欢迎光临欢迎") into the interactive shell, and see where the ^ ends up).

Solution attempt

Here is a naïve attempt to generate "phantom"-strings in Scala. There is a method Character.isIdeographic. I can use it to define a phantom method by mapping every ideographic character to \u3000, and all other characters to ' ' (ordinary space).

def phantom(s: String) = 
  s.map(c => if (Character.isIdeographic(c)) '\u3000' else ' ')

In simple cases, it works. For example, if I define a string

val s = "foo欢迎光临欢迎bar光临欢baz"

and then print the string followed by a vertical bar |, a line break, and then the phantom(s) followed by vertical bar |,

println(s + "|\n" + phantom(s) + "|")

then I obtain:

foo欢迎光临欢迎bar光临欢baz|
                  |

and the vertical bars in the end of the strings line up perfectly, because the phantom(s) is now

\u0020\u0020\u0020\u3000\u3000\u3000\u3000\u3000\u3000\u0020\u0020\u0020\u3000\u3000\u3000\u0020\u0020\u0020

that is:

  • three ordinary spaces corresponding to "foo"
  • six ideographic spaces corresponding to the "欢迎光临欢迎" piece
  • then again three spaces corresponding to "bar"
  • ...

and so on.

However, if I try the same with Croatian characters, I again get a mess:

DŽDždžLJLjljNJNjnj|
         |

(vertical bars don't line up).

Question

Does Unicode define any properties that would allow me to generate robust "phantom" strings of same width?

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
  • The word choice "phantom" was inspired by commands `\vphantom` and `\hphantom` in LaTeX. – Andrey Tyukin Sep 02 '18 at 14:53
  • Do you have to deal with combining characters? – curiousguy Sep 02 '18 at 15:04
  • @curiousguy I would say yes (or rather " y⃝e҈s҉" - a valid Java string literal, by the way). I'd actually prefer a solution that works uniformly for all kind of characters. But I'd also be thankful for partial solutions that degrade somewhat gracefully - e.g. my `phantom` above at least tries to do something meaningful with some CJK characters, it's not optimal, but it's a little better than just duplicating a space `string.length` times. – Andrey Tyukin Sep 02 '18 at 15:16
  • 1
    ... and it's not a purely academic question: error output of a pretty recent Java compiler for an error that occurs at the line `String x = " y⃝e҈s҉ ";(` looks quite broken: the little `^`-symbol that is supposed to point at the mismatched parenthesis is at a completely wrong position: three spaces away from where it's supposed to be. – Andrey Tyukin Sep 02 '18 at 15:23
  • Somewhat related, but not exactly an answer: [1](https://stackoverflow.com/a/22226141/2707792) [2](https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters). Given that there are libraries like [this](https://github.com/janlelis/unicode-display_width) one, and the author claims that *"There is no single standard."* and talks about *"hand-picked adjustments"*, it might well be that there simply is no simple obvious solution that would work everywhere. Note that I don't care about the actual length: I just want to make sure that it's *the same* length. – Andrey Tyukin Sep 03 '18 at 00:04

0 Answers0