12

I'm having a problem with some SQL server queries. Turns out that I have a table with "Attibute_Name" and "Attibute_Value" fields, which can be of any type, stored in varchar. (Yeah... I know.)

All the dates for a particular attribute seem to be stored the the "YYYY-MM-DD hh:mm:ss" format (not 100% sure about that, there are millions of records here), so I can execute this code without problems:

select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from 
    ProductAttributes pa
    inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where 
    a.Attribute_Name = 'SomeDate'

However, if I execute the following code:

select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from 
    ProductAttributes pa
    inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where 
    a.Attribute_Name = 'SomeDate'
    and CONVERT(DATETIME, pa.Attribute_Value) < GETDATE()

I will get the following error: Conversion failed when converting date and/or time from character string.

How come it fails on the where clause and not on the select one?

Another clue:

If instead of filtering by the Attribute_Name I use the actual Attribute_ID stored in database (PK) it will work without problem.

select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from 
    ProductAttributes pa
    inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where 
    a.Attribute_ID = 15
    and CONVERT(DATETIME, pa.Attribute_Value) < GETDATE()

Update Thanks everyone for the answers. I found it hard to actually choose a correct answer because everyone pointed out something that was useful to understanding the issue. It was definitely having to do with the order of execution. Turns out that my first query worked correctly because the WHERE clause was executed first, then the SELECT. My second query failed because of the same reason (as the Attributes were not filtered, the conversion failed while executing the same WHERE clause). My third query worked because the ID was part of an index (PK), so it took precedence and it drilled down results on that condition first.

Thanks!

Alpha
  • 7,586
  • 8
  • 59
  • 92
  • is there another attribute called `sOmeDaTe` - in case you are using a case insensitive collation? Probably its messing up your joins. – YetAnotherUser Aug 31 '11 at 21:17
  • 1
    You seem to be assuming some sort of short circuiting evaluation or guaranteed ordering of the predicates in the `WHERE` clause. This is not guaranteed. When you have mixed datatypes in a column like that the only safe way of dealing them is with a `CASE` expression. – Martin Smith Aug 31 '11 at 21:18
  • What table is PHA? If PHA is a different table than PA then it would appear that PHA's data has some unconvertable records, where as PA's does not. – NoAlias Aug 31 '11 at 21:19
  • I know why you're not 100% sure - if you use a `VARCHAR` column to store date/time values, anyone is free to store any old junk in there that they want. For EAV I've always preferred separating these out into proper columns for at least the three base types, a string attribute, a date/time attribute, and a numeric attribute. It isn't perfect and it isn't a free trade-off, but you have a much better shot at ensuring invalid data stays out of your database. – Aaron Bertrand Aug 31 '11 at 21:25
  • @YetAnotherUser Just checked, the column collation is SQL_Latin1_General_CP1_CI_AS so I guess it would be the case, but I didn't find any other differently-cased value. Good shoot! – Alpha Aug 31 '11 at 21:27
  • @Martin Smith Not really, I'm actually abstracting myself from that point of view, so I would expect it to work or fail no matter where the comparisons are made. That's what motivated my question. – Alpha Aug 31 '11 at 21:28
  • @NoAlias Sorry. I'm intentionally obscuring the real structure of the DB for two reasons: 1. keeping it simple to show the problem I've got 2. not disclose confidential information that I'm not legally allowed to. I already fixed that in the queries above. – Alpha Aug 31 '11 at 21:29
  • @Aaron Bertrand I agree, and I do know that there are some non-date values in this column, but I'm setting up the filters to what is supposed to be only datetime (and the fact of one of the queries not failing should confirm they all can be correctly converted to datetime). This isn't that crappy as the values come from an automated system so I can expect them to be valid... but I am aware of the risk. – Alpha Aug 31 '11 at 21:31
  • @Alpha - Not sure why you say you're abstracted from that POV. `Attribute_Value` contains non dates, you are casting that column to a date. You must believe that some form of pre-filtering is guaranteed. – Martin Smith Aug 31 '11 at 21:36
  • 1
    @Martin Smith I just read Remus' article and I finally understood what you said. You're right, it totally makes sense. – Alpha Aug 31 '11 at 21:42

6 Answers6

9

You seem to be assuming some sort of short circuiting evaluation or guaranteed ordering of the predicates in the WHERE clause. This is not guaranteed. When you have mixed datatypes in a column like that the only safe way of dealing them is with a CASE expression.

Use (e.g.)

CONVERT(DATETIME, 
      CASE WHEN ISDATE(pa.Attribute_Value) = 1 THEN pa.Attribute_Value END)

Not

CONVERT(DATETIME, pa.Attribute_Value)
Martin Smith
  • 438,706
  • 87
  • 741
  • 845
2

If the conversion is in the WHERE clause it may be evaluated for many more records (values) than it would be if it appears in the projection list. I have talked about this before in different context, see T-SQL functions do no imply a certain order of execution and On SQL Server boolean operator short-circuit. Your case is even simpler, but is similar, and ultimately the root cause is the same: do not an assume an imperative execution order when dealing with a declarative language like SQL.

Your best solution, by a far and a large margin, is to sanitize the data and change the column type to a DATETIME or DATETIME2 type. All other workarounds will have one shortcoming or another, so you may be better to just do the right thing.

Update

After a closer look (sorry, I'm @VLDB and only peeking SO between sessions) I realize you have an EAV store with inherent type-free semantics (the attribute_value can bea string, a date, an int etc). My opinion is that your best bet is to use sql_variant in storage and all the way up to the client (ie. project sql_variant). You can coherce the type in the client, all client APIs have methods to extract the inner type from a sql_variant, see Using sql_variant Data (well, almost all client APIs... Using the sql_variant datatype in CLR). With sql_variant you can store multiple types w/o the problems of going through a string representations, you can use SQL_VARIANT_PROPERTY to inspect things like the BaseType in the stored values, and you can even do thinks like check constraints to enforce data type correctness.

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • I'd be very hesitant to use `SQL_VARIANT` *unless* you're doing all presentation, filtering, and comparisons at the client. In our EAV system we quickly moved away from `SQL_VARIANT` in favor of dedicated columns for each type. Ok, so you have two NULLs in each row, but you don't have to deal with all the other nasty things that come with it. Just to give a fair shake at both sides, I blogged a bit about limitations here: https://sqlblog.org/2009/10/12/bad-habits-to-kick-using-the-wrong-data-type ... can you show your query if the column were `SQL_VARIANT`? – Aaron Bertrand Aug 31 '11 at 22:27
  • I see your point. Doing and aggregate over the sql_variant EAV structure would fall victim to the cast issues, while a dedicated column/type can easily aggregate the values because it know they're all in the field for that type, and the type requires no CAST. Valid objection. – Remus Rusanu Aug 31 '11 at 22:55
1

This has to do with the order that a SELECT query is processed. The WHERE clause is processed long before the SELECT. It has to determine which rows to include/exclude. The clause that uses the name must use a scan that investigates all rows, some of which do not contain valid date/time data, whereas the key probably leads to a seek and none of the invalid rows are included at the point. The convert in the SELECT list is performed last, and clearly by this time it is not going to try to convert invalid rows. Since you're mixing date/time data with other data, you may consider storing date or numeric data in dedicated columns with correct data types. In the meantime, you can defer the check in the following way:

SELECT /* ... */
FROM
(
  SELECT /* ... */
    FROM ProductAttributes AS pa
    INNER JOIN dbo.Attributes AS a
    ON a.Attribute_ID = pa.Attribute_ID
    WHERE a.Attribute_Name = 'SomeDate'
    AND ISDATE (pa.Attribute_Value) = 1
) AS z
WHERE CONVERT(CHAR(8), AttributeValue, 112) < CONVERT(CHAR(8), GETDATE(), 112);

But the better answer is probably to use the Attribute_ID key instead of the name if possible.

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • 1
    This isn't guaranteed to work. The compute scalar in the `SELECT` list can be evaluated before the `WHERE` filter. See for example [this answer](http://stackoverflow.com/questions/5191701/tsql-divide-by-zero-encountered-despite-no-columns-containing-0/5203211#5203211) or [this connect item](http://connect.microsoft.com/SQLServer/feedback/details/537419/sql-server-should-not-raise-illogical-errors) – Martin Smith Aug 31 '11 at 21:27
  • No, this won't work. You're making the assumption that the order of declaration (subquery) implies order of evaluation as in http://rusanu.com/2011/08/10/t-sql-functions-do-no-imply-a-certain-order-of-execution/ QO can choose a plan that evaluates the CONVERT *before* the attribute_name comparison and trigger the conversion error. – Remus Rusanu Aug 31 '11 at 21:28
  • No, I should have said, "you can try to defer"... the better answer is to store the data in dedicated columns of the right data type, instead of stuffing everything into a varchar column. – Aaron Bertrand Aug 31 '11 at 21:30
  • @Remus how about if we don't force DATETIME conversion at all. We shouldn't get any bad data in the results because of the ISDATE() inside the subquery, and it's clear this column isn't indexed anyway, so sargeability isn't an issue... After all, what if the OP doesn't have the power to make the schema perfect, and has to deal with what he's got? – Aaron Bertrand Aug 31 '11 at 21:33
  • 1
    The `ISDATE()` *may* be moved by QO out of the subquery and be evaluated *after* the CONVERT. QO is free to do so. It would indeed not happen unless a very specific index lays a 'trap' too sweet for the QO to pass by. I've seen real product support cases from apps that did something similar and were broken on upgrade to new SQL version. there really isn't any reliable middle-of-the-road approach that I know of that imposes a strong type system (project `DATETIME` column) on top of a flexible EAV storage like in OP. `sql_variant` is imho the closest match (Store and project `sql_variant`) – Remus Rusanu Aug 31 '11 at 22:06
  • But who cares where the ISDATE() happens? There is no longer any DATETIME conversion happening, so there will no longer be an error - at worst, this will lead to rows being filtered at a less optimal spot. We all agree that the best solution would be to change the data type, but that's not the only answer, and it's not necessarily an acceptable answer at all. – Aaron Bertrand Aug 31 '11 at 22:08
  • Yes, if the conversion is removed (no more strongly typed projection) then all problems related to the inappropriate CAST vanish. – Remus Rusanu Aug 31 '11 at 22:20
0

Seems like a data issue to me. Take a look at the data when you select it using the two different methods, try looking for distinct lengthsand then select the items in the different sets and eyeball them. Also check for nulls? (I'm not sure what happens if you try converting null to a datetime)

kmcc049
  • 2,783
  • 17
  • 13
0

I think the problem is you have a bad date in your database (obviously).

In your first example, where you aren't checking the date in the WHERE clause, all of the dates where a.attribute.Name = 'SomeDate' are valid, so it never attempts to convert a bad date.

In your second example, the addition to the WHERE clause is causing the query plan to actually convert all those dates and finding the bad one and then looking at attribute name.

In your third example, changing to use Attribute_Id probably changes the query plan so that it only looks for those where id = 15 First, and then checks to see if those records have a valid date, which they do. (Perhaps Attribute_Id is indexed and Attribute_name isn't)

So, you have a bad date somewhere, but it's not on any records with Arttribute_id = 15.

PaulStock
  • 11,053
  • 9
  • 49
  • 52
0

You can check execution plans. It might be that with first query the second criteria ( CONVERT(DATETIME, pa.Attribute_Value) < GETDATE() ) gets evaluated first over all rows including ones with invalid data (not date), while in the case of the second one - a.Attribute_ID = 15 get evaluated first. Thus excluding rows with non-date values.

btw, the second one might be faster also, and if you don't have anything from Attributes in the select list, you can get rid of inner join Attributes a on a.Attribute_ID = pa.Attribute_ID.

On that note, it would be advisable to get rid of EAV while it's no too late :)

nad2000
  • 4,526
  • 1
  • 30
  • 25
  • you can try to recalculate table stats. If `ProductAttributes` contains millions of rows, the evaluation of the `CONVERT(DATETIME, pa.Attribute_Value) < GETDATE()` is highly inefficient. Try: `ANALYZE TABLE ProductAttributes; ANALYZE TABLE Attributes;` and run the first query again. – nad2000 Aug 31 '11 at 21:38