1

So I am aware there are several methods for finding a most similar or say three most similar documents in a corpus of documents. I know there can be scaling issues, for now I have around ten thousand documents and have been running tests on a subset of around thirty. This is what I've got for now but am considering looking into elasticsearch or doc2vec if this proves to be impossible or inefficient.

The scripts work very nicely so far, they use spaCy to tokenise the text and Sklearn TfidfVectorizer to fit accross all the documents, and very similar documents are found. I notice that shape of my NumPy object coming out of the pipeline is (33, 104354) which probably implies 104354 vocab excluding stopwords across all the 33 documents. That step takes a good twenty minutes to run, but the next step being a matrix multiplication which computes all the cosine similarities is very quick, but I know it might slow down as that matrix gets thousands rather than thirty rows.

If you could efficiently add a new document to the matrix, it wouldn't matter if the initial compute took ten hours or even days if you saved the result of that compute.

  1. When I press tab after the . there seems to be a method on the vectorizer called vectorizer.fixed_vocabulary_ . I can't find this method on google or in SKlearn. Anyway, when I run the method, it returns False. Does anyone know what this is? Am thinking it might be useful to fix the vocabulary if possible otherwise it might be troublesome to add a new document to the term document matrix, although am not sure how to do that.

Someone asked a similar question here which got voted up but nobody ever answered.

He wrote:

For new documents what do I do when I get a new document doc(k)? Well, I have to compute the similarity of this document with all the previous ones, which doesn't require to build a whole matrix. I can just take the inner-product of doc(k) dot doc(j) for all previous j, and that result in S(k, j), which is great.

  1. Does anyone understand exactly what he means here or have any good links where this rather obscure topic is explained? Is he right? I somehow think that the ability to add new documents with this inner-product, if he is right, will depend on fixing the vocabulary as mentioned above.
cardamom
  • 6,873
  • 11
  • 48
  • 102
  • [fixed_vocabulary_](https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/feature_extraction/text.py#L270) corresponds to the `vocabulary` parameter in the constructor. It only tells if a custom vocabulary was sent to it, or its learned from the supplied data. – Vivek Kumar Jun 09 '17 at 16:07
  • Thanks ok I see: _vocabulary : Mapping or iterable, optional. Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents._ Will have to find a _vocabulary_ and give it one then. – cardamom Jun 09 '17 at 16:27

3 Answers3

2

As a sequel to my comment on the other answer: yes, the missing vocabulary causes problems, at least it did for me. The problem is that when calculating the tf-idf values (or other) then the words that are not in a vocabulary are not taken into account. To visualize, when we have a sentence "This is karamba." and only the first two words are in the vocabulary, then they get a much higher score, beause "karamba" is an unknown word and doesn't get a score at all. Thus "this" and "is" are much more important in the sentence than they would be if "karamba" was in the vocabulary (and keep in mind that karamba is the only word we actually want to look in this sentence).

Okay, so why is it a problem if "karamba" isn't in the corpus at all? Because we get lots of false positives on the basis that "this" and "is" are really important, even though they're kind of meh.
How did I solve it? Not convenient, but doable.

First I create my corpus' vocabulary as suggested in the other answer.

import copy
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

corpus = []

# Populate the corpus with some data that I have
for d in sorted(os.listdir('d'), key=lambda x: int(x.split('.')[0])):
    with open(os.path.join('d', d)) as f:
        corpus.append(f.read())

corpus_tfidf_vectorizer = TfidfVectorizer()
corpus_tfidf_matrix = corpus_tfidf_vectorizer.fit_transform(corpus)

corpus_vocabulary = defaultdict(None, copy.deepcopy(corpus_tfidf_vectorizer.vocabulary_))
corpus_vocabulary.default_factory = corpus_vocabulary.__len__

Why defaultdict? This is a neat trick I stole from the implementation of the vocabulary creation inside TfidfVectorizer. If you want to look it up, check sklearn.feature_extraction.text.CountVectorizer._count_vocab. Essentially it's just a way of adding words to the vocabulary without worrying too much about the correct indeces.

Anywhoo, now we get to the queries that we want to add into the corpus.

 # Let's say I got a query value from somewhere
 query = f.read()

 query_vocabulary_vectorizer = TfidfVectorizer()
 query_vocabulary_vectorizer.fit_transform([query])
 for word in query_vocabulary_vectorizer.vocabulary_.keys():
     # Added with proper index if not in vocabulary
     corpus_vocabulary[word]

 # Nice, everything in the vocabulary now!

 query_tfidf_matrix = TfidfVectorizer(vocabulary=corpus_vocabulary).fit_transform([query])

Note from 2020: This part may not be required if you're in the relative future compared to 2018 and you have a newer version of scipy.

Oh-kay, now we have to merge the corpus matrix. That is problematic since the matrices aren't the same size any more. We have to resize the corpus matrix because now we (might) have more words in there and we can't merge them without making them the same size. The funny and sad thing about this is that scipy.sparse supports resizing matrices, but resizing CSR matrices isn't supported in the released version of scipy. Thus I installed the master branch of scipy from an arbitrary commit: pip install git+git://github.com/scipy/scipy.git@b8bf38c555223cca0bcc1e0407587c74ff4b3f2e#egg=scipy. PS! You need cython installed to build scipy in your own machine (just pip install cython).

- me in 2018

So that was a hassle, but now we can happily declare:

from scipy import sparse as sp

corpus_tfidf_matrix.resize((corpus_tfidf_matrix.shape[0], query_tfidf_matrix.shape[1]))
# And voilà, we can merge now!
tfidf_matrix = sp.vstack([corpus_tfidf_matrix, query_tfidf_matrix])

Bam, done. The other answer is still the correct one, I'm just elaborating on that solution.

wanaryytel
  • 3,352
  • 2
  • 18
  • 26
  • 1
    Thanks for the answer @wanaryytel! I think you might have a typo. In the first part of the code you use `corpus_vocabulary = defaultdict(None, copy.deepcopy(corpus_count_vectorizer.vocabulary_))`. I think it should be `corpus_tfidf_vectorizer` instaed of `corpus_count_vectorizer`. – ayalaall Jan 15 '20 at 14:19
  • 1
    Another comment @wanaryytel: You are creating `query_vocabulary` in the second chunk of the code, however from what I can see you are not using it. When you iterate to add the word from the query to `corpus_vocabulary` you iterate over `query_vocabulary_vectorizer` and not over `query_vocabulary`. Cheers :) – ayalaall Jan 15 '20 at 15:26
  • 1
    @ayalaall Hmm indeed, I condensed this example from a much bigger code base so evidently I messed some bits up. :) The `query_vocabulary` isn't used indeed but I think the `query_vocabulary_vectorizer.fit_transform([query])` call is still necessary to populate the internal vocab... Right? It's been a while since I did this. :) Anyways, fixed it, thanks! – wanaryytel Jan 15 '20 at 15:31
  • I agree! You still need to call `query_vocabulary_vectorizer.fit_transform([query])` – ayalaall Jan 15 '20 at 16:45
1

Ok, I solved it, took many hours, the other post around this topic confused me with the way it described the linear algebra, and failed to mention an aspect of it which perhaps was obvious to the guy who wrote it.

So thanks for the information about vocabulary..

So vectorizer was an instance of sklearn.feature_extraction.text.vectorizer. I used the vocabulary_ method to pull the vocabulary of the existing 33 texts out:

v = vectorizer.vocabulary_
print (type(v))
>> dict
print (len(v))
>> 104354

Pickled this dictionary for future use and just to test if it worked, reran fit_transform on the pipeline object containing the TfidfVectorizer with the parameter vocabulary=v which it did.

The original pairwise similarity matrix was found by pairwise_similarity = (p * p.T).A where p is a fitted pipeline object, also the term document matrix.

Added a small new document:

new_document= """

Remove the lamb from the fridge 1 hour before you want to cook it, to let it come up to room temperature. Preheat the oven to 200ºC/400ºC/gas 6 and place a roasting dish for the potatoes on the bottom. Break the garlic bulb up into cloves, then peel 3, leaving the rest whole.
"""

Fitted the pipeline to just the one document, with its now fixed vocabulary:

p_new = pipe.fit_transform([new_document]) 
print (p_new.shape)
> (1, 104354)

Then put them together like this:

from scipy.sparse import vstack as vstack_sparse_matrices
p_combined = vstack_sparse_matrices([p, p_new])
print (p_combined.shape)
>> (34, 104354)

and reran the pairwise similarity equation:

pairwise_similarity = (p_combined * p_combined.T).A

Was not totally confident on the code or theory, but I believe this is correct and has worked - the proof of the pudding is in the eating and my later code found the most similar documents to also be cooking related. Changed the original document to several other topics and reran it all, and the similarities were exactly as you would expect them to be.

cardamom
  • 6,873
  • 11
  • 48
  • 102
  • Thanks! I had the same thought about optimizing my code, but didn't know how to do it - you helped me out. But one caveat with this approach (which might not be an issue, but is worth noting). When you specify a vocabulary to the `TfidfVectorizer`, then the new words won't be added - at least for me they aren't. So if there is any words in `new_document` that aren't in the vocabulary, then they are ignored and forgotten about wholly. Not sure if it's a problem in some cases or not. – wanaryytel Apr 02 '18 at 14:26
  • 1
    I knew that when I built my application - Made an effort to use as many documents as possible to assemble the vocabulary which ended up running into hundreds of thousands, as I knew that it would be fixed from that point on (without repeating that exercise). Assumed all the important stuff was already in that vocab though. Pleased you found this useful has been a while since I last used it.. – cardamom Apr 09 '18 at 10:34
  • Thanks! Very useful. Do you have an idea on how to calculate the linear kernel or cosine similarity from the combined matrix? Or perhaps how to get the similarity score of the new document? – BringBackCommodore64 May 31 '19 at 16:55
0

we will fit and then add the new message to the trained Model

tf_one = TfidfVectorizer(analyzer='word', stop_words = "english", lowercase = True)
X_train_one = tf_one.fit_transform(X_train)
nb_one = MultinomialNB()
nb_one.fit(X_train_one , Y_train)

# When you receive a new document
X = tf_one.transform([mymessage_X])
prediction = nb_one.predict(X)
print(prediction)


 # New document 
mymessage_X ="This message will be added to the existing model"
label_Y=" your label"



 tf_two = TfidfVectorizer(analyzer='word', stop_words = "english", lowercase = True ,vocabulary = tf_one.vocabulary_)

X_train_two = tf_two.fit_transform([mymessage_X])
nb = MultinomialNB()
nb.fit(X_train_two, [label_Y])

#print the length of the tf_two vocabulary
len(tf_two.vocabulary_)    

from scipy.sparse import vstack as vstack_sparse_matrices
p_combined = vstack_sparse_matrices([X_train_one, X_train_two])
print (p_combined.shape) 

pairwise_similarity = (p_combined * p_combined.T).A
pairwise_similarity