SQL Server optimal query based on value distribution

Question

I have a table that stores dynamic user data in a key-value pair format. Something like this:

UserId | Key       | Value
---------------------------------
1      | gender    | male
1      | country   | Australia
2      | gender    | male
2      | country   | US
3      | gender    | female
3      | country   | Spain

Now, I need to select the users that have certain parameters, for example: gender is 'male' AND country is 'US'. Or more general:

key1=value1 AND key2=value2 AND key3=value3 AND ...

To do so, the fastest way I found is to do the following:

WHERE key=(key1) 
AND   value=value1
AND   EXISTS(SELECT 1
             FROM (...)
             WHERE key=key2
             AND   value=value2)
AND   EXISTS(SELECT 1
             FROM (...)
             WHERE key=key3
             AND   value=value3)
AND   EXISTS(...)

In that case, I'll get the best result if the first WHERE filter is for the one whose values are more evenly and segregated.

For example, 'gender' can have 99% males and 1% females and country can partitionate the entire population in 100 similar parts. In that case I would need to filter by country first and use EXIST for the gender condition.

Question: Is there any way in SQL Server 2008 R2 to get the index statistics to find what clause is better to put first (not in the EXISTS basically) ?.

Alternative question: I think that that's the best approach, but a way to rewrite that query to be always optimal can be the solution too.

SOLUTION INFORMATION:

The correct solution is the one explained by @usr below (using INTERSECT). Actually it seems I was doing something wrong, the EXISTS was being resolved correctly too by the engine. To provide more info, I am sharing the IO and TIME statistics as well as the execution plan for the options tested:

Using INTERSECT:

Table 'PERFTEST'. Scan count 2, logical reads 113, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms,  elapsed time = 2 ms.

Query plan using INTERSECT

Using EXISTS:

Table 'PERFTEST'. Scan count 2, logical reads 113, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms,  elapsed time = 3 ms.

Query plan using EXISTS

(Note the extra Stream Aggregate step)

Using INNER JOIN:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'PERFTEST'. Scan count 2, logical reads 113, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 31 ms,  elapsed time = 25 ms.

Query plan using INNER JOIN

CONCLUSION:

INTERSECT is slightly faster in this case that EXISTS. The INNER JOIN option is way slower.

Thanks @Tim3880, yes I'm assuming the sub queries you mentioned. I've just omitted the "AND sub.userid = table.userid" for simplicity. — Diego Jancic, May 15 '15 at 21:28
@Tim3880 Thinking again, not sure that that assumption is wrong. Using `WHERE key=... and value=... and EXISTS(...)` produces results totally different than using 2 `EXISTS`. Please note that I have a cluster index in key and value. (the rows returned are the same, the query plan and performance is difference) — Diego Jancic, May 15 '15 at 21:36

usr · Accepted Answer · 2015-05-15T21:39:33.173

3

"what clause is better to put first"

The optimizer does exactly that for you. Queries are not evaluated as written. EXISTS is translated to a join and undergoes the usual join reordering optimizations. Statistics are being used to drive that process. It's not perfect but usually good.

Use OPTION (RECOMPILE) to get a plan that is adjusted for the specific search parameters you are using.

select UserID from T where Condition1
intersect select UserID from T where Condition2
intersect select UserID from T where Condition3

edited May 15 '15 at 21:39

answered May 15 '15 at 21:17

usr

168,620
35
240
369

I would say that that's not true. Using `WHERE key=... and value=... and EXISTS(...)` produces results totally different than using 2 `EXISTS`. Please note that I have a cluster index in key and value. (the rows returned are the same, the query plan and performance is difference) – Diego Jancic May 15 '15 at 21:35
OK, the optimizer seems to not do its job here. Try the query pattern that I just edited. If you want more analysis post actual query plans. – usr May 15 '15 at 21:40
I think `INTERSECT` is the most appropriate and efficient way to do this kind of query. – Vladimir Baranov May 16 '15 at 11:12
Added my results in the question. The `intersect` option was the fastest. Thank you! – Diego Jancic May 18 '15 at 14:27

SQL Server optimal query based on value distribution

1 Answers1