Transliteration on Unicode LATIN LETTERS "WITH STROKE"

Question

Feeding the rule "NFD; [:Nonspacing Mark:] Remove; NFC" into the ICU Transliterator demo, the character Ø (\u00d8 == LATIN CAPITAL LETTER O WITH STROKE) remains as-is (i.e. the STROKE is not stripped).

In the list of non-marking spaces (Category Mn), I cannot find anything named COMBINING DIAGONAL STROKE akin to the COMBINING SHORT STROKE OVERLAY (\u0335) or COMBINING LONG STROKE OVERLAY (\u0336).

However, I do find COMBINING SHORT SOLIDUS OVERLAY (\u0337) and COMBINING LONG SOLIDUS OVERLAY (\u0338). They appear similar, but render as much thicker lines in my browser when combined with o and O.

The Unicode data I accessed for \u00d8 does not provide a decomposition for that character.

At the same time, the ICU Collator Demo will collate each of ø, o, Ø, O, o\u0337 and O\u0338 to the same code point using a Primary (Level = 1 = Base Letter) Collator.

Does this mean that the locale of Collator used in the Demo has been set up to identify the base character in a way where the Unicode spec is silent?

If so, do I need to a custom Rule Based Transliterator if I want to strip the STROKE from LATIN [CAPITAL, SMALL] LETTER * characters on transliteration?

score 2 · Accepted Answer · edited May 23 '17 at 09:58

2

See the following. The Latin-ASCII transliterator went into ICU 4.6. As you noted, the collation demo uses UCA / CLDR tailorings which have O versus slashed-O as base letter differences, this is not the same question as whether there's a decomposition. "w" doesn't decompose into "v + v" either. The decompositions have to do with whether there were existing encodings which represent characters in two different ways.

edited May 23 '17 at 09:58

Community

1
1

answered Jul 29 '11 at 00:09

Steven R. Loomis

4,228
28
39

Using the LATIN-ASCII transform is definitely better than writing my own Tranlsiterator rules! Thanks, Steven. – Jacob Zwiers Aug 04 '11 at 13:09

score 1 · Answer 2 · answered Jul 28 '11 at 16:47

1

Yes. For some reason, the letter Ø does not have a decomposition, so you have to handle it manually.

answered Jul 28 '11 at 16:47

dan04

87,747
23
163
198

score 0 · Answer 3 · edited Mar 06 '14 at 16:53

0

This transform along with replaceAll works even for removing the Ø and other characters.

String id = "Accents-Any;NFD;[:Nonspacing Mark:] Remove; NFC";
System.out.println(latin.replaceAll("[^\\w]",""));

edited Mar 06 '14 at 16:53

Lucas Zamboulis

2,494
5
24
27

answered Mar 06 '14 at 16:30

user3389160

1

Transliteration on Unicode LATIN LETTERS "WITH STROKE"

3 Answers3