How can I determine the width of a Unicode character

Question

me and a friend are programming our own console in java, but we have Problems to adjust the lines correctly, because of the width of the unicode characters which can not be determined exactly. This leads to the problem that not only the line of the unicode, but also following lines are shifted.

Is there a way to determine the width of the unicodes?

Screenshots of the problem can be found bellow.

This is how it should look: https://abload.de/img/richtigslkmg.jpeg

This is an example in Terminal: https://abload.de/img/terminal7dj5o.jpeg

This is an example in PowerShell: https://abload.de/img/powershelln7je0.jpeg

This is an example in Visual Studio Code: https://abload.de/img/visualstudiocode4xkuo.jpeg

This is an example in Putty: https://abload.de/img/putty0ujsk.png

EDIT:

I am sorry that the question was unclear.

It is about the display width, in the example I try to determine the display length to have each line the same length. The function real_length is to calculate/determine and return the display width.

here the example code:

public static void main(String[] args) {
    String[] tests = {
        "Peter",
        "ＳＨＧＡＭＩ",
        "Marcel №1",
        "",
        "‍❤️‍",
        "‍❤️‍‍",
        "‍‍"
    };
    for(String test : tests) test(test);
}

public static void test(String text) {
    int max = 20;
    for(int i = 0; i < max;i++) System.out.print("#");
    System.out.println();
    System.out.print(text);
    int length = real_length(text);
    for(int i = 0; i < max - length;i++) System.out.print("#");
    System.out.println();
}

public static int real_length(String text) {
    return text.length();
}

This Question is unclear as to what exactly you mean by "width". Your extra long line poking out is caused by too many `#` characters. As to how or why you got those, perhaps showing some code here would help. — Basil Bourque, Feb 22 '22 at 22:45
@BasilBourque I agree that the question is unclear, but my understanding is that the OP is asking about determining the width of rendered font characters (i.e. glyphs) rather than the "width" of their Unicode representation. — skomisa, Feb 23 '22 at 04:37
I've written code to compute display width of arbitrary Unicode codepoints in a few languages. Should look into porting it to Java... — Shawn, Feb 23 '22 at 09:21
Hmm, don't see a way to get the East Asian Width property of a codepoint in `Character`. Could do it using ICU4J, but not just the standard library. — Shawn, Feb 23 '22 at 09:28
Some fonts are fixed width; most are not, which breaks ASCII art and table displays. I like Courier New (on microsoft systems). — Andrew, Feb 23 '22 at 12:46
@skomisa Count the number of NUMBER SIGN characters in the fourth line of first two images. You’ll find five more in the second image. That leads me to believe the issue is *not* about the width of rendered glyphs. I suspect their problem is in using the legacy `String#length` method that fails with characters beyond the [BMP](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane). — Basil Bourque, Feb 23 '22 at 20:51
[1] It would be helpful if you could state the font being used in each of your screen shots, since the font is very relevant. [2] It's a general convention here to embed screen shots within your question so that the reader does not have to click a link to view them. That would be especially useful in this case, where we want to compare screen output. [3] I tried to embed your images, but for some reason I was unable to upload them to SO, even though I can view them. — skomisa, Feb 25 '22 at 21:00
If your question is about font display width, why does the number of pound signs differ between screenshots? I have asked twice, but you have not explained. This Question is a confusing mess. Voted to close as unclear. — Basil Bourque, Feb 25 '22 at 21:25
@skomisa Sorry for the screenshots, SO did not allow me to include the screenshots. Also, the whole thing seems to be more difficult than thought, there seems to be no easy way to do what we want to do, which should work font-independent. — SeitWertz, Feb 25 '22 at 23:23
@BasilBourque It changes between screenshots because each screenshot shows the console of a different program. The programs use different fonts, which seems to lead to different widths, which in turn leads to the different shifts in the screenshots. Anyway, it seems that there is no easy way to solve the problem. thanks for the help! — SeitWertz, Feb 25 '22 at 23:31
Unfortunately I don't see how you can have a font-independent solution, because the choice of font is what determines the relative widths of the rendered characters, which in turn determines the (mis)alignment seen in your screen shots. (Of course that only applies when using Java, and you are writing to a terminal. The problem becomes simpler if you could use CSS/HTML, or Swing or JavaFX instead.) — skomisa, Feb 25 '22 at 23:52
I found a solution that allows you to calculate widths for arbitrary strings, including emojis, for a given font without needing a GUI environment. I'm still not convinced that it is a desirable path to go down, and it is (necessarily) a font-dependent solution, but the code required to calculate the width is fairly simple. See my second answer below. — skomisa, Feb 26 '22 at 02:35

score 2 · Answer 1 · answered Feb 25 '22 at 20:48

Unfortunately there is no easy solution to your deceptively simple question, for several reasons:

The width of the characters being rendered on the console might (and probably will) vary, based on the font being used. So the code would need to determine, or assume, the target font in order to calculate widths.
System.out is just a PrintStream that does not know or care about fonts and character width, so any solution has to be independent of that.
Even if you could determine the font being used on the console, and you had a way to determine the width of each character you were trying to render in that specific font, how would that help you? Knowing the variation in widths might conceivably allow you to cleverly tweak the lines being rendered so that they were aligned, but it's just as likely that it wouldn't be practicable.
A potential solution is to leave your code as it stands, and use a monospaced font on the console that println() is writing to, but there are still some major problems with that approach. First, you need to identify a font that is monospaced, but will also support all of the characters you want to render. This can be problematic when including emojis. Second, even if you identify such a font, you may find that all the glyphs for that font are not monospaced! Such a font will ensure that (say) a lowercase i and an uppercase W have the same width, but you can't also make that assumption for emojis, and you can't even assume that the "monospaced" emojis will all have the same non-standard width! Third, the font you identify (if it exists at all) would have to be available in your target environments (your PowerShell, your friend's PuTTY shell, etc.). That is not a major obstacle, but it is one more thing to worry about.
You may find that the rendered text varies by operating system. Your output may look aligned in a Linux terminal window, but that same output, using the same font, might be misaligned in a PowerShell window.

Given all that, a better approach might be to use Swing or JavaFX, where you have finer control over the output being rendered. Even if you are unfamiliar with those technologies, it wouldn't take too long to get something working, just by tweaking some sample code obtained through a search. And even allowing for the learning curve, it would still take less time than coming up with a robust solution for aligning arbitrary characters written to an arbitrary console, because that is a hard problem to solve.

Notes:

Your real_length() method is merely returning the number of code points in the supplied Java String. That relates to its internal representation, and has no direct correlation with the width of the rendered characters, which is determined by the font being used.
See Emoji exceed monospace character width, breaking column alignment #100730 where Microsoft have declined to address the issue for VS Code.
For SO question Java: how to align UTF Miscellaneous Symbols in plain text, see this answer which solved a similar but simpler problem, but only for the Command Prompt window on Windows.

Thanks for your reply, we are trying to find a solution, but we will probably take another way and just leave the unicodes out of the console. — SeitWertz, Feb 25 '22 at 23:34
@SeitWertz Understood. If you don't write any emojis or special characters (i.e. Just use keyboard characters in your `println()`) calls then the problem becomes simpler. In that case, just ensuring that your console/terminal is using a monospaced font may be enough. — skomisa, Feb 25 '22 at 23:41

score 1 · Answer 2 · answered Feb 26 '22 at 02:29

Note: This answer is distinct and qualitatively different from my earlier one (which I still stand by).

There is a simple way for a Java application (i.e. one not using a graphical user interface) to obtain the width of a String being rendered in a given font with a given font size. It requires the use of some awt classes which are supported even in a non-AWT environment. Here's a demo using the data provided in the question:

package fixedwidth;

import java.awt.Canvas;
import java.awt.Font;
import java.awt.FontMetrics;

public class FixedWidth {

    static String[] tests = {
        "Peter", "ＳＨＧＡＭＩ", "Marcel №1", "", "‍❤️‍", "‍❤️‍‍", "‍‍"
    };
    static Font smallFont = new Font("Monospaced", Font.PLAIN, 10);
    static Font bigFont = new Font("Monospaced", Font.BOLD, 24);

    /**
     * This code is based on an answer by SO user Lonzak. 
     * See SO Answer https://stackoverflow.com/a/18123024/2985643
     */
    public static void main(String[] args) {
        FontMetrics fm1 = new Canvas().getFontMetrics(FixedWidth.smallFont);
        FixedWidth.demo(tests, fm1);

        FontMetrics fm2 = new Canvas().getFontMetrics(FixedWidth.bigFont);
        FixedWidth.demo(tests, fm2);
    }

    static void demo(String[] tests, FontMetrics fm) {
        Font f = fm.getFont();
        System.out.println("\nFont name:" + f.getName() + ", font size:" + 
                f.getSize() + ", font style:" + f.getStyle());
        for (String test : tests) {
            int width = fm.stringWidth(test);
            System.out.println("width=" + width + ", data=" + test);
        }
    }
}

The code above is based on this old answer by user Lonzak to the question Java - FontMetrics without Graphics. Those AWT classes allow you to create a Font with defined characteristics (i.e. name, size, style), and then use a FontMetrics instance to obtain the width of an arbitrary String when using that font.

Here is the output from running the code shown above:

Font name:Monospaced, font size:10, font style:0
width=30, data=Peter
width=60, data=ＳＨＧＡＭＩ
width=59, data=Marcel №1
width=10, data=
width=30, data=‍❤️‍
width=40, data=‍❤️‍‍
width=30, data=‍‍

Font name:Monospaced, font size:24, font style:1
width=70, data=Peter
width=149, data=ＳＨＧＡＭＩ
width=140, data=Marcel №1
width=25, data=
width=73, data=‍❤️‍
width=98, data=‍❤️‍‍
width=74, data=‍‍

Notes:

The first set of results shows the widths of the sample data in the question when using plain Monospaced 10 point font. The second set of results shows the widths of those same strings when using bold Monospaced 24 point font.
The widths don't look correct for some of the emojis, but that is because when the source code and output results are pasted into SO some emoji representations are changed, presumably because of the different font being used in the browser. (I was using Monospaced for both the source and the output.) Here's a screen shot of the original output, showing that the widths at least look plausible:
Even though the widths are being calculated and rendered for a fixed width font (Monospaced), it's clear that the width of the emojis cannot be predicted from the widths of normal keyboard characters.

Thanks for your idea and efforts, however we want the program to work on different consoles, which unfortunately makes your idea impractical for us as the consoles all use different fonts. — SeitWertz, Feb 26 '22 at 16:56
Understood. Rendered character (glyph) width depends on the font being used, and if you don't know that font within your application I don't see any solution. You might be able to ascertain the active console font within your application in some specific scenarios fairly easily (e.g. by using JNA and calling `GetCurrentConsoleFont()` for the Command Prompt on Windows), and then use the approach in this answer . But determining the active font for some arbitrary terminal/console on any O/S is impractical. — skomisa, Feb 26 '22 at 19:23
@skomisa I think you may have misunderstood their question. As I understood it, they’re looking for the “column width” of any Unicode codepoint (which is defined in UTR#11, and constant per codepoint, regardless of how it is rendered), not how many pixels the rendered glyph for that codepoint takes up on a given display device (which is clearly not constant and is possibly even non-deterministic). — Peter, Jul 16 '22 at 03:53
this is the method I ended up implementing. You use a character like `a` as a base and get the width ratio of each character (using monospace font). Then you can calculate the total length of the string. Although it is not perfect its better than many other options. — lepe, Sep 22 '22 at 08:19

Peter · Answer 3 · 2022-07-16T03:51:29.903

Sounds like you're looking for a Java implementation of the POSIX wcwidth and wcswidth functions, which implement the rules defined in Unicode Technical Report #11 (which exclusively focuses on display widths for Unicode codepoints when rendered to fixed width devices - terminals and the like). The only such Java implementation that I'm aware of is in the JLine3 library, which is a lot of code to bring in for just this one class, but that may be your best bet.

Note however that that code appears to be incomplete. Unicode codepoint 0x26AA (⚪️), for example, is reported as having a width of 1 by the JLine3 code, but on every platform I've tested on (including here in the StackOverflow editor, which is a fixed width "device") that codepoint is displayed over two columns.

Good luck - this stuff is a lot more complex than it looks. The JVM's unfortunate UCS-2 history (not Sun's fault - it was bad timing wrt the Unicode standard) only makes matters worse, and as others have said here, avoid the char and Character data types like the plague - they do not work the way you expect, and the instant code that uses those types encounters data including codepoints from the Unicode supplemental planes, it is almost certain to function incorrectly (unless the author has been especially careful - do you feel lucky? ).

And just emphasise my comment about complexity, I whipped up a little C program that called `wcwidth` with codepoint 0x26AA, and it returned `-1` (non-printing), so even POSIX seems to get that one wrong, or UTR11 hasn't been updated for it yet, or whatever... — Peter, Jul 16 '22 at 00:55

Basil Bourque · Answer 4 · 2022-02-23T20:57:34.263

tl;dr

Use code points rather than char. Avoid calling String#length.

input 
+ 
"#".repeat( targetLength - input.codePoints().toArray().length )

Details

Your Question neglected to show any code. So I can only guess what you are doing and what might be the problem.

Avoid `char`

I am guessing that your goal is to append a certain number of NUMBER SIGN characters as needed to make a fixed-length row of text.

I am guessing the problem is that you are using the legacy char type, or its wrapper class Character. The char type has been essentially broken since Java 2. As a 16-bit value, char is physically incapable of representing most characters.

Use code point numbers

Instead, use code point integer numbers when working with individual characters. A code point is the number permanently assigned to each of the over 140,000 characters defined in Unicode.

A variety of code point related methods have been added to various classes in Java 5+: String, StringBuilder, Character, etc.

Here we use String#codePoints to get an IntStream of code points, one element for each character in the source. And we use StringBuilder#appendCodePoint to collect the code points for our final result string.

final int targetLength = 10;
final int fillerCodePoint = "#".codePointAt( 0 ); // Annoying zero-based index counting.
String input = "";

int[] codePoints = input.codePoints().toArray();
StringBuilder stringBuilder = new StringBuilder();
for ( int index = 0 ; index < targetLength ; index++ )
{
    if ( index < codePoints.length )
    {
        stringBuilder.appendCodePoint( codePoints[ index ] );
    } else
    {
        stringBuilder.appendCodePoint( fillerCodePoint );
    }
}

Or, shorten that for loop with the use of a ternary operator.

for ( int index = 0 ; index < targetLength ; index++ )
{
    int codePoint = ( index < codePoints.length ) ? codePoints[ index ] : fillerCodePoint;
    stringBuilder.appendCodePoint( codePoint );
}

Report result.

System.out.println( Arrays.toString( codePoints ) );
String output = stringBuilder.toString();
System.out.println( "output = " + output );

[128567, 129312, 129313]

output = #######

There is likely a clever way to write that code more briefly with streams and lambdas, but I cannot think of one at the moment.

And, one could cleverly use the String#repeat method in Java 11+.

String output = input + "#".repeat( targetLength - input.codePoints().toArray().length ) ;

Thanks for your answer, but I need the display width of one character. Via the CodePoints I get the correct value, but the value tells me nothing about how wide it is displayed in the console. — SeitWertz, Feb 24 '22 at 19:50
@SeitWertz So why do your screenshots show a different number of pound signs on the 4th line? — Basil Bourque, Feb 24 '22 at 20:33
The Unicodes have different display widths, some are 2 characters wide or even wider, others have no character width at all. This leads to the fact that when outputting the characters the line is wider or narrower than it should be, so that more characters are ejected than there is space in a line, which in turn leads to shifts in the lines below. The problem is that we can't find a way to determine the display width of the Unicodes to adjust the line width perfectly. — SeitWertz, Feb 24 '22 at 21:05
@SeitWertz No, your second image shows five *additional* pound signs on the fourth line, compared to first image. What does that have to do with the width of a glyph? — Basil Bourque, Feb 24 '22 at 21:41
This is because the first image shows an example of how it shoult be, the others are how it is output through the example program in the various console software. — SeitWertz, Feb 25 '22 at 09:33

How can I determine the width of a Unicode character

4 Answers4

tl;dr

Details

Avoid char

Use code point numbers

Avoid `char`