11

Cosine similarity between two equally-sized vectors (of reals) is defined as the dot product divided by the product of the norms.

To represent vectors, I have a large table of float arrays, e.g. CREATE TABLE foo(vec float[])'. Given a certain float array, I need to quickly (with an index, not a seqscan) find the closest arrays in that table by cosine similarity, e.g. SELECT * FROM foo ORDER BY cos_sim(vec, ARRAY[1.0, 4.5, 2.2]) DESC LIMIT 10; But what do I use?

pg_trgm's cosine similarity support is different. It compares text, and I'm not sure what it does exactly. An extension called smlar (here) also has cosine similarity support for float arrays but again is doing something different. What I described is commonly used in data analysis to compare features of documents, so I was thinking there'd be support in Postgres for it.

sudo
  • 5,604
  • 5
  • 40
  • 78
  • Can you clarify what you mean by "with an index"? Cosine similarity is a binary operation which, in the structure you describe, will operate on pairs of rows. Indexes don't take pairs of rows. – rd_nielsen Jun 28 '17 at 04:25
  • 1
    @rd_nielsen `<` is also a binary operator, but there is btree index support in Postgres to speed up queries with filtering and ordering. <@ is a binary operator for arrays -- and again, there is GiST and GIN support to speed up queries. – Nick Jun 28 '17 at 04:43
  • @sudo can you explain "pg_trgm's cosine similarity support is different" ? there are examples in the PDF from Bartunov/Sigaev with ORDER BY. The task of finding "most similar / closest objects" is called kNN search -- they implemented it for euclidian disnance in GiST few years before that PDF published -- see https://www.pgcon.org/2010/schedule/attachments/168_pgcon-2010-1.pdf, so I assume they considered the same goals (not only "find all objects withing some range", but kNN also) for cosine similarity later. – Nick Jun 28 '17 at 04:47
  • @Nick -- Is it possible, do you think, to create a GiST or GIN index of the cosine similarity of all pairs of rows in a table? – rd_nielsen Jun 28 '17 at 04:56
  • It is definitely possible, this is what "smlar" extension was created for. The only open question to me, will it be possible to have index scans for kNN search (i.e. does it support ORDER BY .. LIMIT ..) or such index only supports only range lookups (like "get all records where similarity < some value"). – Nick Jun 28 '17 at 05:26
  • 1
    Those Russian guys don't create extensions w/o indexing support :) They are authors of full text search (with GiST, GIN insexes), arrays indexing capabilities (GIN and RD-tree via GiST), GiST version of R-tree (replaced original R-tree implementation), hstore, indexing support for jsonb and many more. I would be very surprised if "smlar" was implemented w/o indexing support. – Nick Jun 28 '17 at 05:30
  • @Nick `smlar`'s cosine similarity metric is defined as N_i / sqrt(N_a * N_b), where N_i is the number of unique elements in the intersection and N_a and N_b are number of unique elements in each vector. (1.1, 2.0) and (1.0, 2.1) have 0 similarity by their metric. They explain how the index works in their paper, and it seems this an approximation aimed for certain use cases that makes the indexing possible (unless there's some more complicated way). `pg_trgm` doesn't say what it uses, but it seems to be similar. It also only takes text arguments, so it can't be what I'm looking for. – sudo Jun 28 '17 at 15:42
  • By the way, searching online, there seem to be two different meanings of "cosine similarity." Sometimes I get the set distance metric that `smlar` and (I think) `pg_trgm` use, and sometimes I get the angle distance that Wikipedia describes. – sudo Jun 28 '17 at 16:02
  • http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=tree;hb=HEAD tells us that there should be both GIN and GiST support in `smlar` extension (see files smlar_gin.c and smlar_gist.c) – Nick Jun 29 '17 at 20:43
  • 1
    Usually, distance is defined as 1/similarity. So greater similarity, 'more similar' objects are -- less distance is between them. – Nick Jun 29 '17 at 20:44
  • 1
    Oh. "N_i is the number of unique elements in the intersection and N_a and N_b are number of unique elements in each vector" looks more like Jaccard similarity (and not exactly, some weird variant), definitely not like cosine similarity... Where did you take that definition? Source code or docs? Can you provide a link? – Nick Jun 29 '17 at 20:50
  • @Nick It's in the paper they wrote (http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf) and the source code (lines 670 and 806 https://github.com/jirutka/smlar/blob/master/smlar.c). Just in case, I also tried a few examples in SQL. Re GIN/GiST: Yeah, it supports both. – sudo Jun 30 '17 at 17:43
  • 1
    FWIW, [this post](https://stackoverflow.com/a/58095682/4126114) has a postgresql implementation of cosine similarity. – Shadi Jan 24 '23 at 18:29

2 Answers2

11

I gather that no extension that does this, so I've found a limited workaround:

If A and B are both normalized (length 1), cos(A, B) = 1 - 0.5 * ||A - B||^2. ||A - B|| is the Euclidean distance, and cos(A, B) is the cosine similarity. So greater Euclidean distance <=> lesser cosine similarity (makes sense intuitively if you imagine a unit circle), and if you have non-normal vectors, changing their magnitudes without changing their directions doesn't affect their cosine similarities. Great, so I can normalize my vectors and compare their Euclidean distances...

There's a nice answer here about Cube, which supports n-dimensional points and GiST indexes on Euclidean distance, but it only supports 100 or fewer dimensions (can be hacked higher, but I had issues around 135 and higher, so now I'm afraid). Also requires Postgres 9.6 or later.

So:

  1. Make sure I don't care about having at most 100 dimensions. Upgrade to Postgres 9.6 or later.
  2. Fill my table with arrays to represent vectors.
  3. Normalize the vectors to create an extra column of cube points. Create a GiST index on this column.
  4. Order by Euclidean distance ascending to get cosine similarity descending: EXPLAIN SELECT * FROM mytable ORDER BY normalized <-> cube(array[1,2,3,4,5,6,7,8,9,0]) LIMIT 10;

If I need more than 100 dimensions, I might be able to achieve this using multiple indexed columns. Will update the answer in that case.

Update: Pretty sure there's nothing I can do with splitting the >100-dimension vector into multiple columns. I end up having to scan the entire table.

sudo
  • 5,604
  • 5
  • 40
  • 78
  • Is there any progress? Euclidean distance relation nicely reduces it to closest pair of points problem, but indexing problem seems quite difficult. Perhaps Locality-sensitive hashing may work? – ShellRox Oct 28 '18 at 12:00
  • 2
    @ShellRox Hi! Well, that extension does the indexing, and they have a paper on the hashing mechanism, except Postgres has that arbitrary limitation on an index's per-row size. I can go to around 180 dimensions if I edit the extension to use float4 instead of float8 (aka double). This project was abandoned for unrelated reasons, so I haven't poked at the indexing issue since. Also, I suppose Postgres isn't what people use for latency-sensitive machine learning model inference ;) – sudo Oct 28 '18 at 23:56
  • Thank you for the response! I use Postgres as well and want it to make cosine similarity search over 512 dimensional vectors. Unfortunately, I couldn't find anything efficient (other than full scan) - this question was the only hope. Thus I have to write my own searching function I believe. – ShellRox Oct 29 '18 at 09:04
  • 1
    Henry Conklin's answer might be suitable for your needs. – sudo Nov 03 '18 at 01:32
6

If you're ok with an inexact solution, you can use random projection: https://en.wikipedia.org/wiki/Random_projection.

Randomly generate k different vectors of the same length as your other vectors and store them somewhere. You will use these to spatially bin your data. For each vector in your table, do a dot product with each of the random vectors and store the product's sign.

Vectors with the same sign for each random vector go in the same bin, and generally vectors that have high cosine similarity will end up in the same bin. You can pack the signs as bits into an integer and use a normal index to pull vectors in the same bin as your query, then do a sequential search to find those with the highest cosine similarity.

Henry
  • 86
  • 2
  • 3
  • 1
    I've heard of dimensionality reduction but haven't seen this kind. This is cool because you could even do it with just Postgres tables/queries if you didn't want to write an extension. But it wouldn't have worked for my case because my data was already "compressed" down from higher dimensionality (it's natural language) using SVD or Glove. I was finding that 300d gave more accurate results than 100d. So really it was the full 300 dimensions that I wanted. – sudo Nov 03 '18 at 01:30
  • 3
    Random projection can be used as dimensionality reduction, but in this case we're leaving the vectors as they are and using the random projection for binning. The random projection is being used as a key into a spatial hash table instead of a lower-dimensional representation. So your vectors will stay at the full 300 dimensions, but it will be easier to pull a group of similar vectors on an index. An analogy in 2D would be using the x and y axis as your "random" vectors and grouping your vectors by the quadrant they're in. – Henry Nov 03 '18 at 14:07
  • 1
    This does look similar to [new scientific document](https://arxiv.org/ftp/arxiv/papers/1505/1505.03090.pdf) that utilizes random binary forest to divide vector space into non-overlapping cells of convex hyper-polyhedrons that are bounded by hyper planes. I will definitely try both of them out, thank you! – ShellRox Nov 03 '18 at 14:14
  • @HenryConklin Oh right, I confused myself. That makes sense, and I agree this should work. – sudo Nov 06 '18 at 00:50
  • Years later, I was poking through my SO stuff and decided this was the more correct answer, though mine works without having to write a custom extension. Also felt bad giving myself the checkmark given how much great help you all provided here. – sudo Nov 25 '20 at 01:51