I'm working on an NLP project. And I want to have a list of sentences, whose vectors we've stored on disk. I then want to iterate through these vectors (preferably in chunks) and ask "How similar to this new sentence is this?".
Then store all of the distances as "edges" in a graph.
I've been walking through this tutorial: https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a
And I've managed to get this working
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer().fit_transform(sentences)
print(vectorizer)
<46x186 sparse matrix of type '<class 'numpy.int64'>'
with 283 stored elements in Compressed Sparse Row format>
print(vectorizer.toarray())
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
Where sentences
is a list of cleaned sentences.
But I have 2 concerns that I haven't been able to find a good answer for
- Do I have to compute the vectors with all the sentences every time I add a new one?
- Assuming yes, is this operation expensive on non-trivial datasets?
I've read this question: How to save and load numpy.array() data properly?
Which would allow me to save and load vectors if they can be re-used. But that doesn't answer my question about whether or not I actually can do so.