0

For some examples:

These characters are too short or overlap the surrounding characters:

/b5/ີ/foo
/31/ั/foo
/39/᤹/foo
/a3/ᮣ/foo

These are too long to fit into monospace character slot:

/4b/ോ/foo
/23/ᠣ/fo
/61/ᡡ/foo
/86/ᢆ/foo
/ba/຺/foo

Then blank/whitespace/invisible characters would also be considered ones that don't fit well in the URL.

Wondering if there is a simple way to figure out which characters fall into these slots:

  1. Fits well in URL (latin characters, chinese characters, etc.).
  2. Too large for monospace (chinese characters, the above examples, etc.).
  3. Combining character or overlaps surrounding URL characters (examples above).

Maybe by checking some property on the unicode character there is a way to tell this programmatically, so I don't need to go through each character individually and visually check which category it falls into.

Mainly I am looking for which characters need to be either (a) placed on another character (combining characters), or (b) need some extra padding like the examples above, so you can see them in the URL).

Lokasa Mawati
  • 441
  • 5
  • 15
  • 1
    URLs do not allow non-ASCII characters. Non-ASCII and reserved characters MUST be encoded in a charset of the server's choosing (usually UTF-8, but can be anything) and then the encoded bytes must be encoded in the URL in `%HH` format. IRIs, which replace URLs but are not widespread yet, allow unencoded Unicode characters. – Remy Lebeau May 29 '19 at 20:22

1 Answers1

0

The problem is ill-defined. You claim that the latter five don't fit, but for me they render in one column, which is precisely according to how it's specified in Unicode. Also see: https://stackoverflow.com/a/56216985/46395

use 5.030;
use Unicode::GCString qw();

for (
    "\N{WORD JOINER}",                  # U+2060
    "\N{LATIN SMALL LETTER L}",         # U+006C
    "\N{CJK UNIFIED IDEOGRAPH-4E2D}",   # U+4E2D

    "\N{LAO VOWEL SIGN II}",                # U+0EB5
    "\N{THAI CHARACTER MAI HAN-AKAT}",      # U+0E31
    "\N{LIMBU SIGN MUKPHRENG}",             # U+1939
    "\N{SUNDANESE CONSONANT SIGN PANYIKU}", # U+1BA3

    "\N{MALAYALAM VOWEL SIGN OO}",                  # U+0D4B
    "\N{MONGOLIAN LETTER O}",                       # U+1823
    "\N{MONGOLIAN LETTER SIBE U}",                  # U+1861
    "\N{MONGOLIAN LETTER ALI GALI THREE BALUDA}",   # U+1886
    "\N{LAO SIGN PALI VIRAMA}",                     # U+0EBA
) {
    say Unicode::GCString->new($_)->columns
}
__END__
0
1
2
0
0
0
0
1
1
1
1
1
daxim
  • 39,270
  • 4
  • 65
  • 132