2

I would like to compare different vectors between any two ids for a number of dimensions (i.e, number of components in the vector). For that purpose I calculate the cosine similarity. To do so I have a list of vectors for each id in my dataset. Here is an example data:

create table data (id int, dimension int, v float);
insert into data values (99,1,4);
insert into data values (99,2,3);
insert into data values (99,3,4);
insert into data values (99,4,4);
insert into data values (1234,1,5);
insert into data values (1234,2,5);
insert into data values (1234,3,2);    
insert into data values (1234,4,3);

but in my table I have ~ 2000 ids and ~ 4000 dimensions for each id. This produces a very very large table.

To calculate the cosine similarity I do the following:

select v1.id as id1, v2.id as id2, SUM(v1.v * v2.v)/(SQRT(SUM(v1.v * v1.v))* SQRT(SUM(v2.v * v2.v))) as cosine
from data v1
     inner join data v2 on v1.dimension =v2.dimension and v1.id<v2.id
group by v1.id, v2.id

Unfortunately this solution is very expensive for large tables. Using a set-based approach what would be the best way to optimize this query? Should I use a while loop? other?

Mars
  • 33
  • 6
  • 2
    That **is** a set based approach - a while loop certainly isn't. For optimisation advice you need to include the execution plan (check out "paste the plan"). – Dale K Jun 28 '21 at 09:51
  • Thanks Dale K can you please point me to a resource about that? – Mars Jun 28 '21 at 09:54
  • FYI, there's little point having `and v1.id<>v2.id` in the `ON` and then `v1.id – Thom A Jun 28 '21 at 09:54
  • Do you really need to calculate all 8 million Cosines at once? If so I don't think there is anyway around it being a very expensive operation. – GarethD Jun 28 '21 at 10:34
  • Unfortunately yes :( – Mars Jun 28 '21 at 12:18
  • Please see [paste the plan](https://www.brentozar.com/pastetheplan/instructions/) for a way to include an execution plan in your question. There are ways of denormalizing the data, but they require a good understanding of how the data are used, e.g. how often are id's added, are dimensions ever updated for an id, are there id's that have duplicate dimensions, ... . – HABO Jun 28 '21 at 21:12

1 Answers1

2

You can make a small optimization by precomputing SQRT(SUM(v1.v * v1.v)) in a temporary table.

To test the speed, I have enriched your test data this way:

INSERT INTO data
SELECT id+N, dimension, v+N
FROM data CROSS JOIN (
    SELECT TOP 999 ROW_NUMBER() OVER (ORDER BY low) AS N 
    FROM master.dbo.spt_values
) T

Now we have 2000 distinct id-s, each with 4 dimensions.

The original query returns 1999000 rows in 29 seconds on my machine.

The following script returns the same data in 21 seconds:

CREATE TABLE #Sums (
    id INT PRIMARY KEY,
    S FLOAT
)

INSERT INTO #Sums (id, S)
SELECT id, SQRT(SUM(v*v)) AS S FROM data GROUP BY id

SELECT x.id1, x.id2, x.sp/(s1.s*s2.s)
FROM (
    select v1.id as id1, v2.id as id2, SUM(v1.v * v2.v) as sp
    from data v1 
    inner join data v2 on v1.dimension =v2.dimension and v1.id<v2.id
    group by v1.id, v2.id
) x
INNER JOIN #Sums s1 ON s1.id=x.id1
INNER JOIN #Sums s2 ON s2.id=x.id2

DROP TABLE #Sums
Razvan Socol
  • 5,426
  • 2
  • 20
  • 32
  • 1
    your code is indeed very fast wow, and the calculation is correct too. I hope this thread will help many, thanks a million! – Mars Jun 29 '21 at 08:55