0

I have 8 documents and I ran TF-IDF on it to get an array. I don't understand how I find out which is the best document match for a given input query?

all_documents = [doc1, doc2, ...., doc7]
sklearn_tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
sklearn_representation = sklearn_tfidf.fit_transform(all_documents).toarray()
user3235169
  • 33
  • 1
  • 7
  • By the best document, do you mean the closest document to input query? You have to fit `TfidfVectorizer` to input query and then find the distance to vector from the 7 documents that you have (this can be cosine distance/ euclidean distance). – titipata Mar 21 '17 at 08:32
  • @titipat thanks for the approach. But as I understand, to find any of the mentioned distance the length of vectors should be same. How will I do that? – user3235169 Mar 22 '17 at 01:36

1 Answers1

0

Transform the input to tf-idf format using TfidfVectorizer. You can then use a distance metric (cosine, euclidean, manhattan, ...) to calculate the document that is closest to your input.

Each of the documents should use the same vocabulary. I assume that your 8 document vectors have the same length? The sklearn_tfidf object that you created has an attribute vocabulary_ that contains all words that are used in the vectors. Your input query should be reduced to only containing those words.

Example

Document1: dogs are cute
Document2: cats are awful

Leads to a vocabulary of [dogs, cats, are, cute, awful]. A query containing other words than these 5 cannot be used. For example if your query is cute animals, the animals has no meaning, because it cannot be found in one of the documents. The query thus reduces to following vector: [0,0,0,1,0] since cute is the only word that can be found in the documents.

PinkFluffyUnicorn
  • 1,260
  • 11
  • 20
  • Thanks for the approach. But as I understand, to find any of the mentioned distance the length of vectors should be same. How will I do that? – user3235169 Mar 22 '17 at 01:36
  • Each of the documents should use the same vocabulary. I assume that your 8 document vectors have the same length? The `sklearn_tfidf` object that you created has an attribute `vocabulary_` that contains all words that are used in the vectors. Your input query should be reduced to only containing those words. – PinkFluffyUnicorn Mar 22 '17 at 07:40
  • Yes, the length of every vector is 1058. But how do I convert the input string to length of 1058 size vector? – user3235169 Mar 22 '17 at 08:06
  • I tried and can see list of vocabulary with length 1058. Should I build a vector of length of 1's and 0's comparing every word in input query to the `vocabulary_` or is there a method in scikit learn which I can use directly? – user3235169 Mar 22 '17 at 08:19
  • I think [this thread](http://stackoverflow.com/questions/11911469/tfidf-for-search-queries) can help you out. – PinkFluffyUnicorn Mar 22 '17 at 08:47