3

I would just like to check a few things:

Q1) Latin1_General_CI_AS is Case Insensitive, Accent Sensitive : I.e. SQL will see the following as equal - "hello" and "HELLO"

With LINQ I quiet often do:

db.Where(v => v.Email == "some email".ToLower())

Q2) Assuming my understanding to Q1 is correct, am I just wasting processing time calling ToLower() in my queries?

Q3) Does anybody know if there would be a performance improvement in using Latin1_General_bin over Latin1_General_CI_AS? I.e have there already been performance tests done on a blog etc (thought of this as I was writing the post, so not looked my self yet)

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
Stuart Blackler
  • 3,732
  • 5
  • 35
  • 60
  • Collation doesn't just affect those lookups it also affects sort order. Do your queries ever use `order by` on these columns? If so what semantics do you expect for those? – Martin Smith May 26 '11 at 22:16
  • Generally not no. To be honest, it depends on the nature of the query. Am I correct in saying that the Case/Accent sensitivity is the thing that affects the sort order? – Stuart Blackler May 26 '11 at 22:26
  • `Latin1_General_bin` would sort `A,B,a,b` (uppercase first) and `Latin1_General_CI_AS` as `A,a,B,b`. I can't remember all the gory details about how collations affect sorting. – Martin Smith May 26 '11 at 22:40
  • Per [this other question's answer](http://stackoverflow.com/a/234639/1796930): *"Microsoft has optimized `ToUpperInvariant()`, not `ToUpper()`. The difference is that invariant is more culture friendly..."* – Alexander Nov 10 '12 at 07:38

2 Answers2

5

In general SQL comparisons are case insensitive.
However there are exceptions, e.g. in MySQL if you use a binary varchar comparisons will be case sensitive.

So your ToLower might not be a complete waste of time.

The Latin1_General_bin is case sensitive.
Whereas the Latin1_General_CI_AS is not.

Case sensitive comparison will be faster in the database, but you pay a price if you want to match "some email" to "Some email" you will have to cast to lowercase, losing all that speed gain.
I haven't timed it but I don't think it is worth the hassle.
I recommend smart use of indexes and queries before this micro-optimizations.

-- Premature optimization is the root of all evil, Donald Knuth.

Johan
  • 74,508
  • 24
  • 191
  • 319
3

Performance on real example: Table Adres consists 320K rows of data. We need Adres.Id when we have Email (like in your example).

Database (and table Adres) collation is SQL_Latin1_General_CP1_CI_AS

For performance optimization, non-clustered index was created on column Email(Adres.Id column is included)

Queryies look like:

SELECT  Adres.ID,Email FROM  csc.Adres WHERE EMAIL ='23LMDLh6N@f8CyB7vPL.r4L'

SELECT  Adres.ID,Email FROM  csc.Adres WHERE EMAIL='23LMDLh6N@F8CyB7vPL.r4L' COLLATE Latin1_General_bin

1 row was returned for every query

results:enter image description here

It seems that in second case query is not being identified as SARG by SQL Server. Why? Let us look at details. In first case:

 ScalarOperator ScalarString="CONVERT_IMPLICIT(nvarchar(4000),[@1],0)

And in second:

ScalarOperator ScalarString="CONVERT_IMPLICIT(nvarchar(80),[CSCENTRUMTest].[csc].[Adres].[Email],0)=CONVERT_IMPLICIT(nvarchar(4000),CONVERT(varchar(8000),[@1],0),0)">

So in second case Email is converted to required collation. This case is not SARG and index scan was performed.

If queries can not be identified as SARG (e.g. LIKE '%some email%)', plans are the same.

Assuming: if your query can be identified like SARG and you have appropriate index, no-collation is preferred (it is better to do collation conversation on client/service side).

You can find SARG information in different performance tuning books/articles.

Dalex
  • 3,585
  • 19
  • 25