3

I'm trying to implement the cosine similarity search on pre-vectorized database table (like trigram similarity), having objects in this structure:

from django.contrib.postgres.fields import ArrayField
from django.db import models

class Information(object):
    vectorized = ArrayField(models.FloatField(default=0.0))  # will contain 512-dimensional vector of floats
    original_data = models.TextField(blank=True)
    original_data_length = models.IntegerField(default=0)

where the attribute vectorized will contain 512 dimensional vector that was created generated from original_data.


For example, user inputs a string "What is an Apple?":

  1. Input is converted to 512-dimensional vector A.
  2. A is iterated over all objects x on the database (or not).
  3. On each iteration, normalized dot product (cosine similarity) is calculated between A and x.vectorized (see cosine similarity definition).
  4. x object with highest similarity (highest normalized inner product with A) is chosen, and x.original_data is printed out.

I've implemented simple code for this purpose, it is inefficient since it is performed on the framework level rather than database level, and memory is allocated for all the objects in the database table:

from core.models import Information
from numpy import dot  # dot product = inner product limited for real numbers
from numpy.linalg import norm

user_input = user_input  # let this be 512 dimensional vector converted from user input
most_similar = ("", 0)
for item in Information.objects.all():
    similarity = dot(item, user_input)/norm(item, user_input)
    if similarity > most_similar[1]: 
        most_similar = (item.original_data, similarity)
print(most_similar[0])

Is there any way for implementing more efficient approach of the code above?

Is there any way of doing this using PostgreSQL?

Thank you!

ShellRox
  • 2,532
  • 6
  • 42
  • 90

2 Answers2

1

This has been working for me -- note that it requires pre-normalized vectors which is a good default anyway overall:

CREATE OR REPLACE FUNCTION dot_product_norm_d(a double precision[], b double precision[])
RETURNS double precision AS
$$
SELECT sum(result)
FROM (SELECT (tuple.val1 * tuple.val2) AS result
      FROM (SELECT UNNEST($1) AS val1,
                   UNNEST($2) AS val2,
                   generate_subscripts($1, 1) AS ix) tuple
      ORDER BY ix) inn;
$$ LANGUAGE SQL IMMUTABLE STRICT;

There's an answer to a related problem here that was helpful: Vector (array) addition in Postgres

unmounted
  • 33,530
  • 16
  • 61
  • 61
0

It's not possible to perform cosine similarity on vectors inside PostgreSQL. To do that, you need to use a vector database like AquilaDB or EuclidesDB. AquilaDB supports JSON documents to be stored along with vectors, which I find is very suitable in your case. Because, you could add any metadata that will cross reference any vector indexed in AquilaDB to your PostgreSQL DB. They have some nice tutorials on their wiki page.

ffox
  • 21
  • 1
  • 2