Compare Combining Diacritical Marks and Latin Extended Additional in SQL Server, C#

Question

Problem: two same word below are not equal (try "PHƯỢNG" == "PHƯỢNG" in console)

PHƯỢNG
PHƯỢNG

Describe :

The 'Ợ' characeter in the first word is using Combining Diacritical Marks : U+01A0 + U+0323. This is the data from database

While the second one use Latin Extended Additional : U+1EE2. This is the data which has been input by Vietnamese Locale keyboard

The words is in VietNamese

Output : Although having difference in Unicode block, I want to two words above compare to be equaled in Microsoft SQL Server

More information :

SqlCommand : [Table] COLLATE Vietnamese_CI_AI_KI_WI LIKE N'%'+@Input+'%'

More word to test :

Word : Ế [U+1EBE ] and Ế [U+00CA , U+0301]

TIẾN
TIẾN

WORD : Ờ [U+1EDC] and Ờ [U+01A0, U+0300]

TRƯỜNG
TRƯỜNG

The *collation* specifies the comparison rules. If *that* collation doesn't satisfy the rules you want, use a different one. Most of the time, SQL Server's collations match the Windows locales, which means you should be able to use the collation that corresponds to your machine's locale — Panagiotis Kanavos, Mar 07 '18 at 14:16
YOu can get all collations on SQL Server with `SELECT count(*) FROM sys.fn_helpcollations()`. You can get the *Windows* collations with `SELECT count(*) FROM sys.fn_helpcollations() WHERE name LIKE 'SQL%';` You can get the Vietnamese ones with `SELECT * FROM sys.fn_helpcollations() WHERE name LIKE 'Vietnamese%';` — Panagiotis Kanavos, Mar 07 '18 at 14:22

score 0 · Accepted Answer · answered Mar 07 '18 at 14:13

This can't be compare in SQL cause SQL doesn't support normalize unicode character. The only way to compare a unicode field with different unicode enconding (the one use Combining diacritical mark Picture 1 and the one use latin extended additional block Picture 2 is normalize both of them into a decomposited one which is Picture 1 or fully composited it like Picture 3.

Here is the concept :

Select all of data of field you want to compare and store them into anything you want (array,list - select via EF, datatable via sqlcommand)
Then normalize all of them and the input by looping and call String.Normalize depend on your choice, you can specify string normalize to be fully composite (each accent is a unicode character) or decomposite it : only one character represent for the Ợ
Finally you can filter them normal

Picture 1

Picture 2

Picture 3

Reference :

SQL Server uses a single well defined Unicode encoding *assuming* a Unicode type is used, ie nvarchar or nchar. There aren't multiple encodings to compare. This is a question about collations, not encodings and normalization anyway — Panagiotis Kanavos, Mar 07 '18 at 14:15
Oops. Thanks for clarify me that my answer seem to not relavant to op's answer, but look like he accepted my solution. Hope some body will find better solution to solve this problem by make those character equaled in Sqlserver. I have added those text to database and compare it with VietNamese collation, CI , BIN but no luck. — PhmNgocNghia, Mar 07 '18 at 14:32

Compare Combining Diacritical Marks and Latin Extended Additional in SQL Server, C#

1 Answers1