I would like to compare different vectors between any two ids for a number of dimensions (i.e, number of components in the vector). For that purpose I calculate the cosine similarity. To do so I have a list of vectors for each id in my dataset. Here is an example data:
create table data (id int, dimension int, v float);
insert into data values (99,1,4);
insert into data values (99,2,3);
insert into data values (99,3,4);
insert into data values (99,4,4);
insert into data values (1234,1,5);
insert into data values (1234,2,5);
insert into data values (1234,3,2);
insert into data values (1234,4,3);
but in my table I have ~ 2000 ids and ~ 4000 dimensions for each id. This produces a very very large table.
To calculate the cosine similarity I do the following:
select v1.id as id1, v2.id as id2, SUM(v1.v * v2.v)/(SQRT(SUM(v1.v * v1.v))* SQRT(SUM(v2.v * v2.v))) as cosine
from data v1
inner join data v2 on v1.dimension =v2.dimension and v1.id<v2.id
group by v1.id, v2.id
Unfortunately this solution is very expensive for large tables. Using a set-based approach what would be the best way to optimize this query? Should I use a while loop? other?