Efficiently building a thresholded similarity graph

Question

A thresholded similarity graph is a set of nodes and edges, where nodes are connected by an edge iff the similarity between the two nodes is higher than a given threshold.

Building such graph of n nodes is easy: create a n x n matrix M, place each node in both the column and the rows, then fill each cell C[i,j] with the similarity between the node i and node j iff the result is higher than the given threshold. The complexity here is obviously O(n^2).

This complexity can be slightly improved by not computing C[i, j] if i == j, or if C[j, i] has already been computed (assuming that the similarity between nodes i and j is the same than the similarity between nodes j and i). However, the complexity being then O(n * (n - 1) / 2) is still equivalent to O(n^2).

Given that the similarity function used is either a metric or a the cosine similarity (although this info may not be relevant), is there a way to compute such thresholded similarity graph with a complexity better than O(n^2)?

Thanks, Romain.

I don't think so, this is the reason for example in [Near Duplicates Detection](http://stackoverflow.com/a/23053827/572670) (for search engines) we use a completely different approach, and not the classic cosine similarity - because the cosine similarity is very hard to compute for these when corpus includes billions+ of documents. — amit, May 11 '15 at 20:37
I think what you're really looking for is a way to implicitly *represent* such a graph in less than O(n^2) space. This will be tricky, since the standard representations are adjacency matrix (always O(n^2) space) and adjacency list, which is O(n^2) space for a complete graph -- and it's certainly possible to have a complete graph as a threshold graph. It might still be possible though. — j_random_hacker, May 11 '15 at 20:39
@j_random_hacker I am actually looking for _time_ complexity. I don't mind the O(n^2) _space_ complexity for my purpose. — Romain G, May 11 '15 at 21:12
@amit which approach are you talking about that is used in search engines? (any line of though would be appreciated.) The jaccard index explained in the post you linked still requires an operation between all pairs of documents (or nodes) which doesn't seem to be possible when working with millions of documents. — Romain G, May 11 '15 at 21:12
@RomainG: The time complexity must always be at least the space complexity. Every byte used for storage is read or written at least once! — j_random_hacker, May 11 '15 at 21:13
@j_random_hacker oh that is very true, thank you for insisting. Then I should either optimize my similarity measure or find another approach as suggested by amit. — Romain G, May 11 '15 at 21:16
@RomainG The link I provided in this answer shows an efficient way to calculate it w/o going through every pair (of course in worst case you'll have to, but usually you won't). Specifically, I am speaking about [this link](http://webcourse.cs.technion.ac.il/236375/Winter2013-2014/ho/WCFiles/tutorial_8_near_duplicates_detection.pdf), and section: `Algorithm for Clustering Near-Duplicate Documents` (page 9). As you can see, to compute the Jaccard similarity, you do NOT need to go through all pairs. — amit, May 11 '15 at 21:46
@RomainG Note that this works only because we can calculate jaccard similarity in a probabilistic way. I am not familiar with a trick of doing something similar for cosine-similarity (and I am working, studying and teaching in the field of IR for ~5 years). (Though there might be something that I just didn't ever see. TBH near duplicates is not really my focus, but I do believe I would have seen it if it existed). — amit, May 11 '15 at 22:05
@amit Thank you for your link, I think Jaccard similarity is definitely an option for me and at least worth a try. — Romain G, May 12 '15 at 13:32

Maxime Maillot · Answer 1 · 2016-04-18T13:47:37.030

1

I think you can do it with complexity O(m) where m is the number of edges. If the edge between i and j doesn't exist, then you don't have to put the result in your new graph. Even with a matrix representation, you can achieve O(m), if you use a sparse matrix representation, or an adjacency list.

Of course in some (not so exceptionnal) cases like yours, m = n^2.

edited Apr 18 '16 at 13:47

answered Apr 18 '16 at 13:36

Maxime Maillot

397
2
8

score 0 · Answer 2 · answered May 15 '15 at 16:51

Summarizing the discussion above, there is not really any way answer to my question actually. As mentioned by @j_random_hacker each byte used for graph storage is read or written at least once. The space-complexity of a graph being O(n^2) in the worst case, there is no way to optimize the time-complexity any more.

@amit suggested to use Jaccard similarity in a probabilistic manner, as explained on this link to find out near-duplicates but this is does not seem to be doable with cosine-similarity and will only work with text data (not document-vectors).

An alternative, not reducing the complexity of the algorithm but to create the graph quicker is to use concurrent programming using GPU. This would probably help building such graph with one or two orders of magnitude bigger datasets. Anybody interested should look into CUDA or OpenCL for such purpose.

Efficiently building a thresholded similarity graph

2 Answers2