Transform the input to tf-idf
format using TfidfVectorizer
. You can then use a distance metric (cosine, euclidean, manhattan, ...) to calculate the document that is closest to your input.
Each of the documents should use the same vocabulary. I assume that your 8 document vectors have the same length? The sklearn_tfidf
object that you created has an attribute vocabulary_
that contains all words that are used in the vectors. Your input query should be reduced to only containing those words.
Example
Document1: dogs are cute
Document2: cats are awful
Leads to a vocabulary of [dogs, cats, are, cute, awful]
. A query containing other words than these 5 cannot be used. For example if your query is cute animals
, the animals
has no meaning, because it cannot be found in one of the documents. The query thus reduces to following vector: [0,0,0,1,0]
since cute
is the only word that can be found in the documents.