2

Also what's the vb.net function that will map all those different characters into their most standard form.

For example, tolower would map A and a to the same character right?

I need the same function for these characters

german

ß === s Ü === u Χιοσ == Χίος

Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.

So I want to create a unique ID that maps all those strange characters into a more stable one.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
user4951
  • 32,206
  • 53
  • 172
  • 282
  • 1
    Neither UTF8 nor Unicode are collations. Please be more precise. – SLaks Mar 23 '12 at 05:15
  • My collation in mysql is UTF8_unicode_ci – user4951 Mar 23 '12 at 05:21
  • http://www.unicode.org/reports/tr10/ – SLaks Mar 23 '12 at 05:24
  • http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html – SLaks Mar 23 '12 at 05:25
  • Also take a look at this [link](http://www.collation-charts.org/mysql60/) – Mosty Mostacho Mar 23 '12 at 05:29
  • so I am dealing with "1.2 Canonical Equivalence" where are the list of those Canonical equivalence and is there a vb.net function that map all canonical equivalence glyphs into it's most standard form? – user4951 Mar 30 '12 at 05:33
  • Can anyone turn this into an answer so I can give you easy points. – user4951 May 18 '12 at 06:34
  • Same question as [Is there a function in vb.net that will tell us whether 2 string is equivalent under UTF8 unicode collation?][1] [1]: http://stackoverflow.com/questions/10713304/is-there-a-function-in-vb-net-that-will-tell-us-whether-2-string-is-equivalent-u – SSS Jun 14 '12 at 05:21
  • I forget. Similar but not the same. Either case it's not answered – user4951 Jun 14 '12 at 05:50
  • Maybe that _MySQL complaints that the ID already exist_ because IDs are not case sensitive? For Greek letters Sigma: `'σς'.ToUpper()` gives `ΣΣ`… – JosefZ Jan 09 '21 at 18:10

1 Answers1

1

For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".

However, things get more complicated once you move into the database and deal with collations.

Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,

 Χιοσ != Χίος,

The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).

Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.

It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.


1Almost the same in case of compatibility normalization as opposed to canonical normalization.

Community
  • 1
  • 1
Jirka Hanika
  • 13,301
  • 3
  • 46
  • 75
  • String.Normalize won't turn a ß into an s. It... "Returns a new string whose textual value is the same as this string, but whose binary representation" I want to change the textual value so it match the standard one. – user4951 Jul 03 '12 at 06:48
  • @JimThio - That is because ß and s are simply different letters. They have nothing in common except when sorting. Your app should not have unique constraints on columns where you cannot meet them. – Jirka Hanika Jul 03 '12 at 07:05
  • They are different letters but utf8 collation says they're the same and doesn't allow ids that differ only on those 2. – user4951 Jul 03 '12 at 07:51
  • @JimThio - You mean ss, not s. Collation specifies ordering rules, not equivalence rules. Just don't use such strings as ids; or use the binary collation if you must. – Jirka Hanika Jul 03 '12 at 08:04
  • I like to use such strings as IDs. It make it easier for me to debug. Ah ha, the id is mc-donald--6.19-107.7 it must be that mcDonald restaurant. – user4951 Jul 03 '12 at 08:54