What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

Question

Also what's the vb.net function that will map all those different characters into their most standard form.

For example, tolower would map A and a to the same character right?

I need the same function for these characters

german

ß === s Ü === u Χιοσ == Χίος

Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.

So I want to create a unique ID that maps all those strange characters into a more stable one.

Neither UTF8 nor Unicode are collations. Please be more precise. — SLaks, Mar 23 '12 at 05:15
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html — SLaks, Mar 23 '12 at 05:25
Also take a look at this [link](http://www.collation-charts.org/mysql60/) — Mosty Mostacho, Mar 23 '12 at 05:29
so I am dealing with "1.2 Canonical Equivalence" where are the list of those Canonical equivalence and is there a vb.net function that map all canonical equivalence glyphs into it's most standard form? — user4951, Mar 30 '12 at 05:33
Can anyone turn this into an answer so I can give you easy points. — user4951, May 18 '12 at 06:34
Same question as [Is there a function in vb.net that will tell us whether 2 string is equivalent under UTF8 unicode collation?][1] [1]: http://stackoverflow.com/questions/10713304/is-there-a-function-in-vb-net-that-will-tell-us-whether-2-string-is-equivalent-u — SSS, Jun 14 '12 at 05:21
I forget. Similar but not the same. Either case it's not answered — user4951, Jun 14 '12 at 05:50
Maybe that _MySQL complaints that the ID already exist_ because IDs are not case sensitive? For Greek letters Sigma: `'σς'.ToUpper()` gives `ΣΣ`… — JosefZ, Jan 09 '21 at 18:10

score 1 · Answer 1 · edited May 23 '17 at 10:34

For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".

However, things get more complicated once you move into the database and deal with collations.

Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same¹, mean the same thing. For example,

 Χιοσ != Χίος,

The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).

Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.

It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.

¹Almost the same in case of compatibility normalization as opposed to canonical normalization.

String.Normalize won't turn a ß into an s. It... "Returns a new string whose textual value is the same as this string, but whose binary representation" I want to change the textual value so it match the standard one. — user4951, Jul 03 '12 at 06:48
@JimThio - That is because ß and s are simply different letters. They have nothing in common except when sorting. Your app should not have unique constraints on columns where you cannot meet them. — Jirka Hanika, Jul 03 '12 at 07:05
They are different letters but utf8 collation says they're the same and doesn't allow ids that differ only on those 2. — user4951, Jul 03 '12 at 07:51
@JimThio - You mean ss, not s. Collation specifies ordering rules, not equivalence rules. Just don't use such strings as ids; or use the binary collation if you must. — Jirka Hanika, Jul 03 '12 at 08:04
I like to use such strings as IDs. It make it easier for me to debug. Ah ha, the id is mc-donald--6.19-107.7 it must be that mcDonald restaurant. — user4951, Jul 03 '12 at 08:54

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

1 Answers1

Linked