Len function : unicode character counted as 2

Question

Please explain for me why is the below string's length is 9 instead of 8?

DECLARE @nstring NVARCHAR(100)

SET @nstring =N'Không có'

Print len(@nstring) -- 9 

SET @nstring =N'Không co'

Print len(@nstring) -- 8

@TimBiegeleisen I don't believe thats a duplicate - len and datalength return the same value and there are no trailing spaces. Its the accented character. — Dale K, Apr 29 '19 at 04:55
I am getting a length of 8 for both strings, see here: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=5dbbcf065929ef8283ebe00c25f03cf0 — Tim Biegeleisen, Apr 29 '19 at 04:55
@TimBiegeleisen I wonder whether the copy and paste has changed the char for you, because I copied the query into query analyser and am observing the same issue as the OP. — Dale K, Apr 29 '19 at 04:57
@DaleBurrell thanks you, I don't think that this is a duplicate too, and don't know why there are downvotes... — EagerToLearn, Apr 29 '19 at 04:59
@TimBiegeleisen `unicode('ó')` of the OPs last char gives 111, your last char gives 243 - so its a different char. — Dale K, Apr 29 '19 at 05:00
@TimBiegeleisen please see it here : https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=18c26449b5b2fa7c1acd438e5f1891a8 — EagerToLearn, Apr 29 '19 at 05:01
@EagerToLearn - the dup flag has been removed - you might need to refresh — Dale K, Apr 29 '19 at 05:02
There is something odd about that character though, it appears to be a 2 part char, in query analyser when you hit backspace it first turns into an 'o' then the second time it is deleted. Whereas the char Tim used deleted in one go. So it seems you char is 2 chars superimposed - or something - never seen that before. Where does your data come from? — Dale K, Apr 29 '19 at 05:03
@DaleBurrell I think it has to do with how `LEN()` counts UTF-16 characters. But, I don't have an exact answer here. — Tim Biegeleisen, Apr 29 '19 at 05:05
@DaleBurrell It is Vietnamese, and I just typed it by the keyboard — EagerToLearn, Apr 29 '19 at 05:06
Interesting, but you can generate the same looking char using `nchar(243)` and this one isn't split into 2 i.e. works as you would expect. — Dale K, Apr 29 '19 at 05:07
https://en.wikipedia.org/wiki/Precomposed_character In your case the last "character" `ó` is made of two unicode codepoints 0x6F00 and 0x0103. You can see it if you cast your string to varbinary `SELECT CAST(@nstring AS varbinary(200))`. So, `LEN` counts them as two characters. — Vladimir Baranov, Apr 29 '19 at 05:11
@VladimirBaranov You have the endianness the wrong way round. It's actually 0x006F and 0x0301. — Mr Lister, Apr 29 '19 at 06:37

score 2 · Accepted Answer · answered Apr 29 '19 at 06:57

For some characters, there are several ways they can be encoded as Unicode.
In this case, an "ó" can be either U+00F3 (one 16-bit value) or U+006F U+0301 (two 16-bit values). These forms are canonically equivalent.

If you feel like reading a bit more, Using Unicode Normalization to Represent Strings by Microsoft.

Unfortunately, there is no way in T-SQL to convert a string from one form into another. See also questions like Normalize unicode string in SQL Server?

But the good news is since they're canonically equivalent, they compare the same in T-SQL (you can write N'Không có'=N'Không có' and the result is true) so it's not that big a problem as you may think at first.

score 1 · Answer 2 · answered Apr 29 '19 at 06:30

Your last character count as two

DECLARE @nstring NVARCHAR(100)
SET @nstring =N'Không có'

print unicode(substring(@nstring,8,1))
print ascii(substring(@nstring,8,1))

print unicode(substring(@nstring,9,1))
print ascii(substring(@nstring,9,1))

The result is displayed in the image below.

Len function : unicode character counted as 2

2 Answers2