Optimize self join on millions of rows

Question

I have a table which is a link table from objects in my SQL Server 2012 database, (annonsid, annonsid2). This table is used to create chains of triangle or even rectangles to see who can swap with who.

This is the query I use on the table Matching_IDs which has 1,5 million rows in it, producing 14 million possible chains using this query:

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid
  INNER JOIN Matching_IDs AS m3
     ON m2.annonsid2 = m3.annonsid
       AND m.annonsid = m3.annonsid2

I must improve performance to take maybe 1 second or less, Is there a faster way to do this? The query takes about 1 minute on my computer. I normally use a WHERE m.annonsid=x, but it takes just the same amount of time, cause it has to go through all possible combinations anyway.

Update: the latest query plan

|--Compute Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
   |--Stream Aggregate(DEFINE:([globalagg1011]=SUM([partialagg1010])))
        |--Parallelism(Gather Streams)
             |--Stream Aggregate(DEFINE:([partialagg1010]=Count(*)))
                  |--Hash Match(Inner Join, HASH:([m2].[annonsid2], [m2].[annonsid])=([m3].[annonsid], [m].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m2].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m3].[annonsid] AND [MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m2].[annonsid]))
                       |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m2].[annonsid2], [m2].[annonsid]))
                       |    |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m2]))
                       |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid], [m].[annonsid2]))
                            |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([m].[annonsid])=([m3].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m].[annonsid]=[MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m3].[annonsid2]))
                                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m].[annonsid]), ORDER BY:([m].[annonsid] ASC))
                                 |    |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133152] AS [m]), ORDERED FORWARD)
                                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid2]), ORDER BY:([m3].[annonsid2] ASC))
                                      |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m3]), ORDERED FORWARD)

What indexes do you have defined on the table? An index `(annonsid2,annonsid)` may be a good option. — Oded, Dec 29 '12 at 22:38
I have a clustered index of both columns, and one non-clustered for every column (2) — infinity1975, Dec 29 '12 at 22:39
Well, it takes about 2 seconds from when i Press run until it starts to show results, but it doesn't mean the query is that fast right?, cause when I just do a Count(m.annonsid) instead of getting all the records, its taking around 1 minute. — infinity1975, Dec 29 '12 at 22:41
Here is a screenshot of my execution plan: [link](http://www.supportcaddy.com/executionplan.png) — infinity1975, Dec 29 '12 at 22:45
This is Estimated execution plan. Use `SET SHOWPLAN_TEXT ON` to get execution plan and post here. — Hamlet Hakobyan, Dec 29 '12 at 22:47
Runned execution plan (1 minute), [link](http://www.supportcaddy.com/executionplan2.png) — infinity1975, Dec 29 '12 at 22:51
Nothing new. Have you set show text plan option? And in your estimated plan diagram i haven't see where clause? — Hamlet Hakobyan, Dec 29 '12 at 22:54
ok sorry, a bit new to this, here it comes: [link](http://www.supportcaddy.com/debug.txt) — infinity1975, Dec 29 '12 at 22:59
The query test does not match the execution plan. Please make them consistent. — usr, Dec 29 '12 at 23:41
it does now, it was just count(*) instead of selecting columns. — infinity1975, Dec 29 '12 at 23:46
just for a try SELECT COUNT(*) FROM Matching_IDs AS m INNER JOIN Matching_IDs AS m2 ON m.annonsid2 = m2.annonsid Where Exist(Select * from Matching_IDs m3 where m2.annonsid2 = m3.annonsid AND m.annonsid = m3.annonsid2) — bummi, Dec 30 '12 at 00:20
What does "who can swap with who" mean? Could you provide some sample data please, and describe what it MEANS? When you are not just doing `Count(*)`, what columns are you pulling? Without truly understanding the full context of what you're doing, our brains cannot activate effectively to be sure we're giving you the best answer. Sometimes, answers that you couldn't have imagined are possible, but only when the experts are fully aware of the whole problem space rather than just the tiny slice you've presented. — ErikE, Dec 30 '12 at 06:15
Erik E, I understand. What I am having is a table with 2 ids (id1, id2), this is id of an ad and points that id1 is interested in swapping with id2. This way I can build direct swaps between two ad's (2 ids), triangle swaps like a triangle with three corners where id1 gets id2, who gets id3 who gets id1. same in a rectangle swap, id1 > id2 > id3 > id4 > id1. Do you understand? — infinity1975, Dec 30 '12 at 11:38
Please tag with the @ symbol or people will not get notified. Why is swapping always forward--for every `id,id2` pair is `id2,id` in the table? What are you doing with the results? If you have ads 1, 2, 3, and 4 you have rows `1,2`, `2,3`, `3,4` and `4,1` (right?) but why not `2,4` or `4,2`? Something here is triggering my instincts that there could be a different way to accomplish your purpose. — ErikE, Dec 31 '12 at 08:17
@infinity1975 I wish you would answer, this sounds like a fun/interesting puzzle. — ErikE, Jan 02 '13 at 08:25
1.5 million is nothing. We need a number that self joins a 40 million rows (and counting) table. Obviously it's not working too well. :/ — Sleeper Smith, Jan 10 '14 at 04:21

score 3 · Answer 1 · answered Dec 30 '12 at 09:47

3

Some ideas:

Try two indexes (annonsid,annonsid2) and (annonsid2,annonsid)

Have you tried a column store index? It makes the table read only but it might improve performance.

Also, some variations of the query could help. Examples:

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid
  INNER JOIN Matching_IDs AS m3
     ON m2.annonsid2 = m3.annonsid
where m.annonsid = m3.annonsid2

or

SELECT COUNT(*)
FROM Matching_IDs AS m, Matching_IDs AS m2, Matching_IDs AS m3
where m2.annonsid2 = m3.annonsid
  and m.annonsid2 = m2.annonsid
  and m.annonsid = m3.annonsid2

Did you check the CPU/IO-Load? If IO-Load is high, then the server is not crunching numbers but swapping => more RAM solves the problem.

How fast is this query?

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid

If this is very fast but adding the next join slows thing down then you propably need more RAM.

answered Dec 30 '12 at 09:47

alzaimar

4,572
1
16
30

#1 and #2 is similair (22 seconds) same as my normal query, i managed to use a distinct table which went down from 1 minute to 22 seconds. The 3rd query is very fast, 2 seconds. I have 8gb ram. – infinity1975 Dec 30 '12 at 11:35
Actually what I really want is to remove the 1 to 1 id table, but I have no clue on how I can know which ad matches which in an easy way. I thought about having alot of bigint's and using bitwise and between the ads. This is narrowing the search down to 15.000 ads (rows) instead of 1,5 million. BUT, the end result will anyway be 15 million so it might not be worth it. – infinity1975 Dec 30 '12 at 11:44
What do you mean by 1 to 1 id table? Regarding performance: I just did a test on my laptop. Counting 6 mio rows takes a while (20 secs) and so I doubt you can get much faster. – alzaimar Dec 30 '12 at 23:35
It's a table with 2 id's pointing left to right, id1 wants id2. – infinity1975 Dec 31 '12 at 09:34
How about adding `and m.AnnonId != m.AnnonId2`? Do this with all three tables (m, m1, m2). It might help a bit, but they get sorted out anyway, as you said. – alzaimar Dec 31 '12 at 11:37

score 1 · Answer 2 · answered Dec 29 '12 at 23:54

It seems like you already indexed this quite well. You can try converting the hash to a merge join by adding the right multi-column index, but it won't give you the desired speedup of 60x.

I think this index would be on annonsid, annonsid2 although I might have made a mistake here.

It would be nice to materialize all of this but indexed views do not support self-joins. You can try to materialize this query (unaggregated) into a new table. Whenever you execute DML against the base table, also update the second table (using either application logic or triggers). That would allow you to query blazingly fast.

score 1 · Answer 3 · answered Dec 30 '12 at 07:36

You should make this query a bit more separated. I think first you should create a table, where you can store the primary key + annonsid, annonsid2 -if annosid is not the primary key itself.

DECLARE @AnnonsIds TABLE
(
primaryKey int,
-- if you need later more info from the original rows like captions  
-- AND it is not (just) the annonsid
annonsid int,
annonsid2 int
)

If you declare a table, and you have index on this column, it is quite fast to get the specified rows by the WHERE annonsid = @annonsid OR annonsid2 = @annosid

After the first step you have a much smaller (I guess), and "thin" table to work with. Then you just have to use the joins here or make a temp table and a CTE on it.

I think it should be faster, depending on the selectivity of the condition in the WHERE, of yourse if 1.1 million rows fits in it, then it does not make sense, but if just a couple hundred or tousend, then you should give it a try!

score 0 · Answer 4 · edited Dec 31 '12 at 11:52

0

You could denormalize the data by adding a table RelatedIds with AnnonsId, RelatedAnnonId and Distance. For every value of AnnonsId the table would contains rows for each RelatedAnnonId and the number of relations that need to be traversed to reach it, aka Distance. Triggers on the existing MatchingIds table would maintain the new table with some configured maximum value for Distance, e.g. 3 to handle rectangular shares. Index the table on (AnnonsId, Distance).

Edit: An index on (Distance, AnnonsId) will allow you to quickly find rows that have enough related entries to form a particular shape. Adding a column for MaxDistance may be useful if you want to be able to exclude rows based on, for example, having a triangular but not rectangular relationship.

The new query would inner join RelatedIds as RI on RI.AnnonsId = m.AnnonsId and RI.Distance <= @MaxDistance with the desired "shape" dictating the value of @MaxDistance.

It should provide much better performance on the select. Downsides are another table with a large number of rows and overhead of the triggers when altering the MatchingIds table.

Example: There are two entries in Matching_IDs: (1,2) and (2,3). The new table would contain 3 entries:
1-> 2: distance = 1
1-> 3: distance = 2 (it takes one intermediate 'node' to go from 1 to 3)
2-> 3: distance = 1

adding one more entry to the matching ids (3,1) would result in another entry:
1-> 1: distance = 3

And voilá: You found a triangle (distance=3).

Now, to find all triangles, simply do this:

select * 
  from RelatedIds 
 where AnnonsId=RelatedAnnonId 
   and Distance=3

edited Dec 31 '12 at 11:52

alzaimar

4,572
1
16
30

answered Dec 30 '12 at 19:46

HABO

15,314
5
39
57

this was way to complex for me to understand :D, What I already have is a table relatedids, this is the table "Matching_IDs" who have (annonsid, annonsid2), What I dont understand is the "Distance" and how to get it?, and what the triggers will do. – infinity1975 Dec 31 '12 at 09:38
The problem persist I think?, the problem is the performance, not to create the outcome. This trigger is just the same as using my query by providing an id making it run on only that id, and then save the result in a second table. Am I wrong? – infinity1975 Jan 02 '13 at 10:41
@infinity1975 - The difference is that the trigger divides the heavy lifting into a relatively small amount of work every time the base table is changed. A downside is that the work is always performed, even if you never query for "shapes". The upside is that the "shapes" query should be much faster. – HABO Jan 02 '13 at 14:31
Yes, I have a solution for this, except the distance parameter. I have one table for directswaps and one for triangleswaps. – infinity1975 Jan 02 '13 at 17:21
I have one table for directswaps and one for triangleswaps. But I was out for a solution making the first select faster. It's still a time consuming task. – infinity1975 Jan 02 '13 at 18:06
@infinity1975 - I keep getting the feeling that you don't understand triggers. The basic idea is much like maintaining a bank account balance. You can either sum up every transaction for an account from the beginning of time to determine the current balance or you can take a little extra time when each transaction occurs to maintain a separate "current balance" field for the account. The first approach is nicely normalized, but the performance will suffer over time. The second approach spreads the workload out over time so that the performance of querying a balance remains constant. – HABO Jan 02 '13 at 18:37
@infinity1975 - Your comment re: having tables for direct and triangle swaps is unclear to me. How do you keep those tables current? – HABO Jan 02 '13 at 18:38
By just "precalcing" them when an ad changes. It's similair to trigger but I do it in my c# code. – infinity1975 Jan 02 '13 at 19:21

score 0 · Answer 5 · answered Dec 31 '12 at 07:42

0

1 - change select Count(*) to Count(1) or Count(id)

2 - Write set Nocount on at the first of your Stored procedure or at the first of your Query

3 - Use index on annonsid , annonsid2

4 - Having your Indexes after the primary key in your Table

answered Dec 31 '12 at 07:42

Alireza Masali

668
1
10
19

1- there is no difference between `count(*)` and count(1)` etc. I tested it. – alzaimar Dec 31 '12 at 11:38
Correct. There is a difference when it comes to WHAT to count. But this does not apply to infinity1975's problem, as far as I can see it. But +1 for pointing that out. – alzaimar Dec 31 '12 at 13:01

Optimize self join on millions of rows

5 Answers5