2

In my terminal these are equally wide:

ヌー平行
parallel
æøåüäöûß

same width of "ヌー平行" and "parallel" same width of "ヌ" and "p"

I have managed to get Perl to give the length 8 for the last 2 lines, but it reports the length of the first line as 4. Is there a way for me to determine that the width of ヌ is twice that of ø?

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • The answer surely depends on what font you are using. – mob Mar 07 '23 at 17:24
  • 1
    Does this answer your question? [How to determine whether a unicode character is fullwidth or halfwidth in Perl](https://stackoverflow.com/questions/70834053/how-to-determine-whether-a-unicode-character-is-fullwidth-or-halfwidth-in-perl) – Shawn Mar 07 '23 at 17:25
  • @mob Does it? All the fixed width fonts I have tried acts the same way. – Ole Tange Mar 07 '23 at 17:26
  • @Ole It would be more correct to say that this depends on your terminal's font rendering engine, which often overrides the font's spacing to force fixed-width text. Reasonable terminals will display full-width CJK chars across two columns, but I'm not aware of any standard that would require this. – amon Mar 07 '23 at 17:27
  • Relevant standard: https://www.unicode.org/reports/tr11/tr11-40.html – Mark Tolonen Mar 07 '23 at 17:28
  • This has nothing to do with UTF-8 or Unicode in particular, but [halfwidth and fullwidth forms](https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms). – AmigoJack Mar 10 '23 at 11:00

1 Answers1

3

You can use Text::CharWidth's mbswidth. It uses POSIX's wcwidth.

use v5.14;
use warnings;

use utf8;
use open ':std', ':encoding(UTF-8)';

use Encode             qw( encode_utf8 );
use Text::CharWidth    qw( mbswidth );
use Unicode::Normalize qw( NFC NFD );

my @tests = (
   [ "ASCII",     "parallel",      8 ],
   [ "NFC",       NFC("æøåüäöûß"), 8 ],
   [ "NFD",       NFD("æøåüäöûß"), 8 ],
   [ "EastAsian", "ヌー平行",      8 ],
);

for ( @tests ) {
   my ( $name, $s, $expect ) = @$_;
   my $length = length( $s );
   my $got = mbswidth( encode_utf8( $s ) );
   printf "%-9s length=%2d expect=%d got=%d\n", 
      $name, $length, $expect, $got;
}
ASCII     length= 8 expect=8 got=8
NFC       length= 8 expect=8 got=8
NFD       length=13 expect=8 got=8
EastAsian length= 4 expect=8 got=8

Note that mbswidth expects a string encoded using the locale's encoding, which I assumed was UTF-8 in two places in the above program.


If you want to know the number of column a string should take according to Unicode, this is covered by Unicode Standard Annex #11. Note that the answer may depend on whether one is in an East Asian context or not. For example, U+03A6 GREEK CAPITAL LETTER PHI ("Φ") takes up two columns in an East Asian Context, while it takes up only one otherwise.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • `mbswidth("ヌ")` == 2, but `mbswidth("ヌ\t")` == 1. Why? – Ole Tange Mar 07 '23 at 22:44
  • 1
    `wcwidth` returns -1 for "errors" (non-printable characters), and Text::CharWidth's `mbswidth` doesn't treat that case specially, so you end up with 2 + -1 = 1. You could submit a ticket suggesting alternative behaviour, such as returning `undef`. – ikegami Mar 08 '23 at 06:13