Faster way compare changes between SQL sub-queries?

Question

I need to check for the change in the color of cats between two dates.

I have the query below:

WITH
cats_prior AS (SELECT IDTAG,Color FROM  CATS WHERE APPOINTMENT = '06/30/2019'),
cats_now AS (SELECT IDTAG,Color FROM  CATS WHERE APPOINTMENT = '08/31/2019')

SELECT cats_prior.IDTAG, cats_prior.Color,cats_now.Color
FROM cats_prior
JOIN cats_now on cats_prior.IDTAG = cats_now.IDTAG
WHERE cats_prior.Color != cats_now.Color

It works but it takes 11 minutes and there are around 15 million cats in that table.

Is there another way to do this? or a way to make this faster?

This is SQL Server.

How is nobody asking how the color of a cat changes in 2 months? Hah! — avery_larry, Oct 23 '19 at 18:54

Gordon Linoff · Answer 1 · 2019-10-23T18:36:00.040

3

I would try aggregation with a HAVING clause:

SELECT IDTAG
FROM  CATS
WHERE APPOINTMENT IN ('2019-08-31', '2019-06-30')
GROUP BY IDTAG
HAVING MIN(COLOR) <> MAX(COLOR);

An index on CATS(APPOINTMENT, IDTAG, COLOR) would help. This index might also speed up your version of the query.

edited Oct 23 '19 at 18:36

answered Oct 23 '19 at 18:30

Gordon Linoff

1,242,037
58
646
786

This is pretty slick. It's the fastest so far. I'll mark it as the answer after a few more hours if nothing faster comes along. Thanks – DNS_Jeezus Oct 23 '19 at 20:17
@DNS_Jeezus . . . What are the timings? – Gordon Linoff Oct 23 '19 at 20:22
I thought this could be faster than my answer, but I am also curious about the timings. – avery_larry Oct 23 '19 at 20:44
I actually don't know anymore. There's some wierd caching or something going on. avery's is faster today. But after the strange caching. avery's ran on average in 8 seconds gordon's ran in 12 seconds my old query ran in 23 seconds. – DNS_Jeezus Oct 24 '19 at 15:00
@DNS_Jeezus . . . The performance of the query is going to depend on many factors. One of the interesting factors is how many `IDTAG`s there are on a given day. More or fewer will definitely have an impact on which is best. – Gordon Linoff Oct 24 '19 at 18:42

avery_larry · Accepted Answer · 2019-10-23T18:41:36.993

1

Just another way to try the same thing:

select distinct cp.IDTAG
from CATS cp
inner join CATS cn
   on cp.IDTAG = cn.IDTAG
   and cp.color <> cn.color
where cp.APPOINTMENT = '06/30/2019'
   and cn.APPOINTMENT = '08/31/2019'

You can check performance against your data.

edited Oct 23 '19 at 18:41

answered Oct 23 '19 at 18:32

avery_larry

2,069
1
5
17

DISTINCT usually causes longer execution time. Also cp.IDTAG should be used after the "select", alias missing. – Eray Balkanli Oct 23 '19 at 18:38
I would presume that duplicates are not wanted. That leaves group by or distinct, which I believe perform nearly the same. If the OP doesn't care about duplicates (or can otherwise guarantee no cat has multiple appointments on those days) then by all means distinct can be removed. – avery_larry Oct 23 '19 at 18:43
https://stackoverflow.com/questions/7943957/huge-performance-difference-when-using-group-by-vs-distinct – Eray Balkanli Oct 23 '19 at 18:44
Did you read the post you linked? Distinct ended up faster than group by (barely). Regardless, the OP can check both options for the best performance. select cp.idtag ... group by cp.idtag. In this simple query I would be surprised if the difference is significant, though it clearly could be. – avery_larry Oct 23 '19 at 18:48

Faster way compare changes between SQL sub-queries?

2 Answers2