How to optimize vector performance in Sklearn

Question

I'm working on an NLP project. And I want to have a list of sentences, whose vectors we've stored on disk. I then want to iterate through these vectors (preferably in chunks) and ask "How similar to this new sentence is this?".

Then store all of the distances as "edges" in a graph.

I've been walking through this tutorial: https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a

And I've managed to get this working

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer().fit_transform(sentences)
print(vectorizer)

<46x186 sparse matrix of type '<class 'numpy.int64'>'
    with 283 stored elements in Compressed Sparse Row format>

print(vectorizer.toarray())

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Where sentences is a list of cleaned sentences.

But I have 2 concerns that I haven't been able to find a good answer for

Do I have to compute the vectors with all the sentences every time I add a new one?
Assuming yes, is this operation expensive on non-trivial datasets?

I've read this question: How to save and load numpy.array() data properly?

Which would allow me to save and load vectors if they can be re-used. But that doesn't answer my question about whether or not I actually can do so.

The obvious optimization would be to not compute the similarity with every sentence in your database - instead you would compute the similarity of e.g. the top 10 most similar sentences. Every other sentence would have no connection. To find similar vectors quickly you can use a structure such as a [KD tree](https://en.wikipedia.org/wiki/K-d_tree). — Nick ODell, Nov 13 '22 at 23:57
@NickODell I had no idea such a thing existed and I don't even know where to begin asking questions about it. But thanks, that does seem helpful — jordynfinity, Nov 14 '22 at 01:01
You may also want to consider [ScaNN](https://pypi.org/project/scann/) or [FAISS](https://github.com/facebookresearch/faiss), both of which are libraries for efficient search of similar vectors. Never used either, so I can't vouch for them, but they're supposed to do well in cases where you have a huge number of dimensions in the vector. (kd works best with fewer dimensions.) — Nick ODell, Nov 14 '22 at 02:13
You are better to ask on another site, such as [Data Science](https://datascience.stackexchange.com/). Anyway you should compute vectors all over again for adding a new sentence. And it can be expensive for a large corpus. Why are you so concerned? Just experiment yourself. — relent95, Nov 14 '22 at 07:26

How to optimize vector performance in Sklearn

0 Answers0