1
class Item(models.Model):
    vector_repr = models.TextField(..., verbose_name='jsonified vector representation')

...
# My current solution:

def as_vector(item): return np.asarray(json.loads(item.vector_repr))

item = Item.objects.get(...)
item_vect = as_vector(item)

def cosine_similarity(other): return np.dot(item_vect, as_vector(other))

db_items = Item.objects.exclude(id=item.id)
similar_items = sorted(db_items, key=cosine_similarity) 

Basically i want to sort all the Items in a mysql database applying the cosine similarity with a given item.

The problem is that the vector that represents all the items (vector_repr) is very large, and the items in the database are a lot, so this method is really slow (~2min).

How can i speed up this process? (Possibly without storing in my db the similarity of each pair of items)

ale-cci
  • 153
  • 2
  • 11
  • Your solution is very confusing. Why are you using `lambda` for named functions? Why aren't you using the `it` argument for the first lambda? And why are you using a text field for something that presumably is numerical data? – Håken Lid Oct 18 '18 at 21:53
  • 2
    The typical way to drastically improve performance is to let the database do as much as possible of the processing. But this requires some familiarity with postgresl (which is the database you should use). And the array should be stored in an [ArrayField](https://docs.djangoproject.com/en/2.1/ref/contrib/postgres/fields/#arrayfield) See this question: [Postgres: index on cosine similarity of float arrays for one-to-many search](https://stackoverflow.com/questions/44792202/postgres-index-on-cosine-similarity-of-float-arrays-for-one-to-many-search) – Håken Lid Oct 18 '18 at 21:59
  • Sorry, there was a typo in the first lambda. Unfurtunately i'm using mysql (forgot to specify it the question) – ale-cci Oct 18 '18 at 22:06
  • You might get a lot better performance just by switching to postgresql and using arrayfield, since you wouldn't need that extra json parsing step. – Håken Lid Oct 18 '18 at 22:13
  • First, determine what code is doing the "cosine similarity". (Nothing in MySQL does such.) Second, determine if most of the time is spent there. Then we can discuss optimizations. – Rick James Oct 20 '18 at 00:01

0 Answers0