So I am aware there are several methods for finding a most similar or say three most similar documents in a corpus of documents. I know there can be scaling issues, for now I have around ten thousand documents and have been running tests on a subset of around thirty. This is what I've got for now but am considering looking into elasticsearch or doc2vec if this proves to be impossible or inefficient.
The scripts work very nicely so far, they use spaCy to tokenise the text and Sklearn TfidfVectorizer to fit accross all the documents, and very similar documents are found. I notice that shape of my NumPy object coming out of the pipeline is (33, 104354) which probably implies 104354 vocab excluding stopwords across all the 33 documents. That step takes a good twenty minutes to run, but the next step being a matrix multiplication which computes all the cosine similarities is very quick, but I know it might slow down as that matrix gets thousands rather than thirty rows.
If you could efficiently add a new document to the matrix, it wouldn't matter if the initial compute took ten hours or even days if you saved the result of that compute.
- When I press tab after the . there seems to be a method on the vectorizer called
vectorizer.fixed_vocabulary_
. I can't find this method on google or in SKlearn. Anyway, when I run the method, it returnsFalse
. Does anyone know what this is? Am thinking it might be useful to fix the vocabulary if possible otherwise it might be troublesome to add a new document to the term document matrix, although am not sure how to do that.
Someone asked a similar question here which got voted up but nobody ever answered.
He wrote:
For new documents what do I do when I get a new document doc(k)? Well, I have to compute the similarity of this document with all the previous ones, which doesn't require to build a whole matrix. I can just take the inner-product of doc(k) dot doc(j) for all previous j, and that result in S(k, j), which is great.
- Does anyone understand exactly what he means here or have any good links where this rather obscure topic is explained? Is he right? I somehow think that the ability to add new documents with this inner-product, if he is right, will depend on fixing the vocabulary as mentioned above.