Which Unicode characters are "composing" characters (whose sole purpose is to add accent, tilda)?

Question

This is related to

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

This is how I plan to do this:

Use http://msdn.microsoft.com/en-us/library/dd374126%28v=vs.85%29.aspx to turn the string into

KD form.

Basically it'll turn most variation such as superscript into the normal number. Also it decompose tilda and accent into 2 characters.

Next step would be to remove all characters whose sole purpose is tildaing or accenting character.

How do I know which characters are like that? Which characters are just "composing characters"

How do I find such characters? After I find those, how do I get rid of it? Should I scan character by character and remove all such "combining characters?"

For example: Character from 300 to 362 can be gotten rid off.

Then what?

score 3 · Accepted Answer · answered Jul 05 '12 at 15:06

3

Combining characters are listed in UnicodeData.txt as having a nonzero Canonical_Combining_Class, and a General_Category of Mn (Mark, nonspacing).

answered Jul 05 '12 at 15:06

dan04

87,747
23
163
198

how do we know that in vb.net? – user4951 Jul 06 '12 at 01:57

bobince · Answer 2 · 2012-07-06T14:13:45.960

2

For each character in the string, call GetUnicodeCategory and check the UnicodeCategory for NonSpacingMark, SpacingCombiningMark or EnclosingMark.

You may be able to do it more efficiently using regex, eg Regex.Replace(str, "\p{M}", "").

edited Jul 06 '12 at 14:13

answered Jul 06 '12 at 14:06

bobince

528,062
107
651
834

Which Unicode characters are "composing" characters (whose sole purpose is to add accent, tilda)?

2 Answers2