I'm trying to implement the cosine similarity search on pre-vectorized database table (like trigram similarity), having objects in this structure:
from django.contrib.postgres.fields import ArrayField
from django.db import models
class Information(object):
vectorized = ArrayField(models.FloatField(default=0.0)) # will contain 512-dimensional vector of floats
original_data = models.TextField(blank=True)
original_data_length = models.IntegerField(default=0)
where the attribute vectorized
will contain 512 dimensional vector that was created generated from original_data
.
For example, user inputs a string "What is an Apple?":
- Input is converted to 512-dimensional vector
A
. A
is iterated over all objectsx
on the database (or not).- On each iteration, normalized dot product (cosine similarity) is calculated between
A
andx.vectorized
(see cosine similarity definition). x
object with highest similarity (highest normalized inner product withA
) is chosen, andx.original_data
is printed out.
I've implemented simple code for this purpose, it is inefficient since it is performed on the framework level rather than database level, and memory is allocated for all the objects in the database table:
from core.models import Information
from numpy import dot # dot product = inner product limited for real numbers
from numpy.linalg import norm
user_input = user_input # let this be 512 dimensional vector converted from user input
most_similar = ("", 0)
for item in Information.objects.all():
similarity = dot(item, user_input)/norm(item, user_input)
if similarity > most_similar[1]:
most_similar = (item.original_data, similarity)
print(most_similar[0])
Is there any way for implementing more efficient approach of the code above?
Is there any way of doing this using PostgreSQL?
Thank you!