Affecting performance with SQL Collation

Question

I would just like to check a few things:

Q1) Latin1_General_CI_AS is Case Insensitive, Accent Sensitive : I.e. SQL will see the following as equal - "hello" and "HELLO"

With LINQ I quiet often do:

db.Where(v => v.Email == "some email".ToLower())

Q2) Assuming my understanding to Q1 is correct, am I just wasting processing time calling ToLower() in my queries?

Q3) Does anybody know if there would be a performance improvement in using Latin1_General_bin over Latin1_General_CI_AS? I.e have there already been performance tests done on a blog etc (thought of this as I was writing the post, so not looked my self yet)

Collation doesn't just affect those lookups it also affects sort order. Do your queries ever use `order by` on these columns? If so what semantics do you expect for those? — Martin Smith, May 26 '11 at 22:16
Generally not no. To be honest, it depends on the nature of the query. Am I correct in saying that the Case/Accent sensitivity is the thing that affects the sort order? — Stuart Blackler, May 26 '11 at 22:26
`Latin1_General_bin` would sort `A,B,a,b` (uppercase first) and `Latin1_General_CI_AS` as `A,a,B,b`. I can't remember all the gory details about how collations affect sorting. — Martin Smith, May 26 '11 at 22:40
Per [this other question's answer](http://stackoverflow.com/a/234639/1796930): *"Microsoft has optimized `ToUpperInvariant()`, not `ToUpper()`. The difference is that invariant is more culture friendly..."* — Alexander, Nov 10 '12 at 07:38

score 5 · Accepted Answer · answered May 26 '11 at 22:17

In general SQL comparisons are case insensitive.
However there are exceptions, e.g. in MySQL if you use a binary varchar comparisons will be case sensitive.

So your ToLower might not be a complete waste of time.

The Latin1_General_bin is case sensitive.
Whereas the Latin1_General_CI_AS is not.

Case sensitive comparison will be faster in the database, but you pay a price if you want to match "some email" to "Some email" you will have to cast to lowercase, losing all that speed gain.
I haven't timed it but I don't think it is worth the hassle.
I recommend smart use of indexes and queries before this micro-optimizations.

-- Premature optimization is the root of all evil, Donald Knuth.

score 3 · Answer 2 · answered May 27 '11 at 06:34

Performance on real example: Table Adres consists 320K rows of data. We need Adres.Id when we have Email (like in your example).

Database (and table Adres) collation is SQL_Latin1_General_CP1_CI_AS

For performance optimization, non-clustered index was created on column Email(Adres.Id column is included)

Queryies look like:

SELECT  Adres.ID,Email FROM  csc.Adres WHERE EMAIL ='23LMDLh6N@f8CyB7vPL.r4L'

SELECT  Adres.ID,Email FROM  csc.Adres WHERE EMAIL='23LMDLh6N@F8CyB7vPL.r4L' COLLATE Latin1_General_bin

1 row was returned for every query

results: enter image description here

It seems that in second case query is not being identified as SARG by SQL Server. Why? Let us look at details. In first case:

 ScalarOperator ScalarString="CONVERT_IMPLICIT(nvarchar(4000),[@1],0)

And in second:

ScalarOperator ScalarString="CONVERT_IMPLICIT(nvarchar(80),[CSCENTRUMTest].[csc].[Adres].[Email],0)=CONVERT_IMPLICIT(nvarchar(4000),CONVERT(varchar(8000),[@1],0),0)">

So in second case Email is converted to required collation. This case is not SARG and index scan was performed.

If queries can not be identified as SARG (e.g. LIKE '%some email%)', plans are the same.

Assuming: if your query can be identified like SARG and you have appropriate index, no-collation is preferred (it is better to do collation conversation on client/service side).

You can find SARG information in different performance tuning books/articles.

Affecting performance with SQL Collation

2 Answers2

Linked