27

Check out the following example. It shows that searching within a unicode string (nvarchar) is almost eight times as bad as searching within a varchar string. And on par with implicit conversions. Looking for an explanation for this. Or a way to search within nvarchar strings more efficiently.

use tempdb
create table test
(
    testid int identity primary key,
    v varchar(36),
    nv nvarchar(36),
    filler char(500)
)
go

set nocount on
set statistics time off
insert test (v, nv)
select CAST (newid() as varchar(36)),
    CAST (newid() as nvarchar(36))
go 1000000

set statistics time on
-- search utf8 string
select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
-- CPU time = 906 ms,  elapsed time = 911 ms.

-- search utf8 string using unicode (uses convert_implicit)
select COUNT(1) from test where v like N'%abcd%' option (maxdop 1)
-- CPU time = 6969 ms,  elapsed time = 6970 ms.

-- search unicode string
select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
-- CPU time = 6844 ms,  elapsed time = 6911 ms.
Michael J Swart
  • 3,060
  • 3
  • 29
  • 46
  • 2
    FYI, turns out the higher CPU in the implicit conversion example (query 2) is *not* due to the conversion itself, but to unicode comparison logic, just like the other unicode query (query 3). – Michael J Swart Jan 18 '11 at 15:29
  • This an excellent question and I've added a link to my answer here [varchar-vs-nvarchar-performance](http://stackoverflow.com/questions/35366) – gbn Jan 18 '11 at 16:27
  • @gbn, in that post you linked to http://msdn.microsoft.com/en-us/library/ms189617.aspx which is the explanation I like best. Thanks! – Michael J Swart Jan 20 '11 at 19:09
  • Turned this question into a blog post: http://michaeljswart.com/2011/02/searching-inside-strings-cpu-is-eight-times-worse-for-unicode-strings/ – Michael J Swart Jun 08 '11 at 04:50

5 Answers5

21

Looking for an explanation for this.

NVarchar is 16 bit and Unicode comparison rules are a lot more complicated than ASCII - special chars for the various languages that are supported at the same time require quote some more processing.

TomTom
  • 61,059
  • 10
  • 88
  • 148
  • Hmmm. interesting. In theory then using a binary collation might be a bit faster... stay tuned. – Michael J Swart Jan 17 '11 at 20:35
  • 7
    Oh my God, that's it! When using "nv COLLATE Latin1_General_Bin like N'%ABCD%'" I get: -- CPU time = 890 ms, elapsed time = 881 ms. – Michael J Swart Jan 17 '11 at 20:41
  • 5
    Let me guess - you are english speaker ;) Talk to some people from germany and france and you start realizing the partially ODD rules around accents and special chars. This simply take time to resolve ;) Good we nailed that ;) – TomTom Jan 17 '11 at 21:08
2

My guess is that LIKE is implemented using an O(n^2) algorithm as opposed to an O(n) algorithm; it would probably have to be for the leading % to work. Since the Unicode string is twice as long, that seems consistent with your numbers.

Larry Coleman
  • 202
  • 1
  • 2
  • 10
  • You're right, that explanation is consistent with the numbers, until I did a further experiment (see comment under TomTom's answer). Thanks for stopping by Larry – Michael J Swart Jan 17 '11 at 20:50
  • @Michael: I'm curious about whether you see the same result with the varchar column. – Larry Coleman Jan 17 '11 at 20:59
  • With varchar+latin collation I get "cpu time = 891" which is a bit better than without the collation, but I can't tell if it's significantly better without having a decent grasp of stats. :-) – Michael J Swart Jan 17 '11 at 21:05
2

A LIKE %% search is implemented as > and < . Now more the number of rows, more the processing time as SQL can't really make effective use of statistics for %% like searches.

Additionally unicode search requires additional storage and along with collation complications, it would typically not be as efficient as the plain vanilla varchar search. The fastest collation search as you have observed is the binary collation search.

These kind of searches are best suited for Full-Text Search or implemented using FuzzyLookup with an in-memory hash table in case you have lots of RAM and a pretty static table.

HTH

1

I've seen similar problems in SQL Server. There was a case where I was using parameterized queries, and my parameter was UTF-8 (default in .net) and the field was varchar (so not utf-8). Ended up with was converting every index value to utf-8 just to do a simple index lookup. This might be related in that the entire string might be getting translated to another character set to do the comparison. Also for nvarchar, "a" would be the same as "á" meaning that there's a lot more work going on there to figure out if 2 strings are equal in unicode. Also, you might want to use full text indexing, although I'm not sure if that solves your problem.

Kibbee
  • 65,369
  • 27
  • 142
  • 182
  • Thanks Kibbee. The collation that was used was already accent sensitive and so it wasn't that particular cause. Also full text indexing doesn't work in my case because the strings I'm searching aren't on word boundaries. But thanks for helping. – Michael J Swart Jan 17 '11 at 20:48
  • -1 I really don't want to be negative, but everything stated in this answer is incorrect. .NET / Windows / SQL Server use UTF-16 Little Endian ("Unicode" in Microsoft-land). There is no UTF-8 unless you have a `byte[]` of those bytes; a string is UTF-16 LE, same as `NVARCHAR` (and `XML`) in SQL Server. Your issue was `VARCHAR` data using a SQL Server Collation (one starting with `SQL_`) in the index and comparing that to an `NVARCHAR` string. That combination requires an implicit conversion due to 2 different sorting algorithms. `VARCHAR` data with a Windows Collation wouldn't do that. (cont) – Solomon Rutzky Jun 05 '17 at 17:05
  • Also, `'a'` and `'á'` are not the same in `NVARCHAR`. Whether or not they equate is determined by the accent-sensitivity option (i.e. `_AI` vs `_AS` in the name) of each particular Collation. And they can be deemed as being either the same or different for both `VARCHAR` and `NVARCHAR`. Try the following to see them as being equal as `VARCHAR` data using a deprecated SQL Server Collation: `SELECT 1 WHERE 'a' = 'á' COLLATE SQL_Latin1_General_CP1_CI_AI;`. And to clarify: you could have used a `VARCHAR` param in your query to fix it; it converted to `NVARCHAR` due to datatype precedence. – Solomon Rutzky Jun 05 '17 at 17:21
0

It's because the sorting rules of unicode characters are more complicated than sorting rules for non-unicode characters.

But, things are not as simple as varchar vs nvarchar

You also have to consider SQL Collation vs Windows Collation as explained here.

SQL Server performs string comparisons of non-Unicode data defined with a Windows collation by using Unicode sorting rules. Because these rules are much more complex than non-Unicode sorting rules, they are more resource-intensive. So, although Unicode sorting rules are frequently more expensive, there is generally little difference in performance between Unicode data and non-Unicode data defined with a Windows collation.

As it's stated, for Windows Collation, SQL Server will use unicode sorting rules for varchar, hence you will have no performance gain.

Here is an example:

-- Server default collation is Latin1_General_CI_AS
create table test
(
    testid int identity primary key,
    v varchar(36) COLLATE Latin1_General_CI_AS, --windows collation
    v_sql varchar(36) COLLATE SQL_Latin1_General_CP1_CI_AS, --sql collation
    nv nvarchar(36),
    filler char(500)
)
go

set nocount on
set statistics time off
insert test (v, nv)
select CAST (newid() as varchar(36)),
    CAST (newid() as nvarchar(36))
go 1000000

set statistics time on

-- search utf8 string
select COUNT(1) from test where v_sql like '%abcd%' option (maxdop 1)
-- CPU time = 625 ms,  elapsed time = 620 ms.

-- search utf8 string
select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
-- CPU time = 3141 ms,  elapsed time = 3389 ms.

-- search utf8 string using unicode (uses convert_implicit)
select COUNT(1) from test where v like N'%abcd%' option (maxdop 1)
-- CPU time = 3203 ms,  elapsed time = 3209 ms.

-- search unicode string
select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
-- CPU time = 3156 ms,  elapsed time = 3151 ms.

As you can see, there is no difference between varchar and nvarchar with windows collation.

Note: It seems that SQL collations are only included for legacy purpose and should not be used for new projects (even if they seem to have better performance).

Gudradain
  • 4,653
  • 2
  • 31
  • 40