0

This is related to

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

This is how I plan to do this:

Use http://msdn.microsoft.com/en-us/library/dd374126%28v=vs.85%29.aspx to turn the string into

KD form.

Basically it'll turn most variation such as superscript into the normal number. Also it decompose tilda and accent into 2 characters.

Next step would be to remove all characters whose sole purpose is tildaing or accenting character.

How do I know which characters are like that? Which characters are just "composing characters"

How do I find such characters? After I find those, how do I get rid of it? Should I scan character by character and remove all such "combining characters?"

For example: Character from 300 to 362 can be gotten rid off.

Then what?

Community
  • 1
  • 1
user4951
  • 32,206
  • 53
  • 172
  • 282

2 Answers2

3

Combining characters are listed in UnicodeData.txt as having a nonzero Canonical_Combining_Class, and a General_Category of Mn (Mark, nonspacing).

dan04
  • 87,747
  • 23
  • 163
  • 198
2

For each character in the string, call GetUnicodeCategory and check the UnicodeCategory for NonSpacingMark, SpacingCombiningMark or EnclosingMark.

You may be able to do it more efficiently using regex, eg Regex.Replace(str, "\p{M}", "").

bobince
  • 528,062
  • 107
  • 651
  • 834