1

Let'say I use for a single document

text="bla agao haa"
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range= 
(4,6),preprocessor=my_tokenizer, max_features=100).fit([text])

single=singleTFIDF.transform([text])
query = singleTFIDF.transform(["new coming document"])

If I understand correct, transform just uses the learned weights from fit. So, for the new document, query contains the weights for each feature within the document. It looks like [[0,,0,0.13,0.4,0]]

As I use n-grams, I would like to get the features too for this new document. So I know for the new document the weights to each feature in this document.

EDIT:

in my case I get for single and query the following array:

single
[[0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125]]
query
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.57735027 0.57735027 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.57735027 0.         0.
  0.         0.         0.        ]]

But this is strange as from the learned corpus (single) all features have weights of 0.10721125. So how can a feature of the new document has a weight of 0.57735027?

  • What are you trying to analyse - character or word n-grams? – KRKirov Jan 22 '19 at 19:24
  • I use char_wb, but what you mean? –  Jan 22 '19 at 19:28
  • ‘char_wb’ creates character n-grams only from text inside word boundaries - is this really what you want? – KRKirov Jan 22 '19 at 19:29
  • The question is not about that! –  Jan 22 '19 at 19:47
  • True, but reading it one wonders whether you understand what your code is doing. It would be easier to give you an example with word n-grams, rather than character n-grams. – KRKirov Jan 22 '19 at 19:49
  • I wonder if this way is the right way to calculate similiaritry between those two docs? I thought transform uses learned feature weights but it calculates just for reach doc the weigths, so they are uncorrelated? –  Jan 22 '19 at 19:53

2 Answers2

0

Details of how Scikit-Learn calculates tfidf are available here and here is an example of its implementation using word n-grams.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Train the vectorizer
text="this is a simple example"
singleTFIDF = TfidfVectorizer(ngram_range=(1,2)).fit([text])
singleTFIDF.vocabulary_ # show the word-matrix position pairs

# Analyse the training string - text
single=singleTFIDF.transform([text])
single.toarray()  # displays the resulting matrix - all values are equal because all terms are present

# Analyse two new strings with the trained vectorizer
doc_1 = ['is this example working', 'hopefully it is a good example', 'no matching words here']

query = singleTFIDF.transform(doc_1)
query.toarray() # displays the resulting matrix - only matched terms have non-zero values

# Compute the cosine similarity between text and doc_1 - the second string has only two matching terms, therefore it has a lower similarity value
cos_similarity = cosine_similarity(single.A, query.A)

Output:

singleTFIDF.vocabulary_ 
Out[297]: 
{'this': 5,
 'is': 1,
 'simple': 3,
 'example': 0,
 'this is': 6,
 'is simple': 2,
 'simple example': 4}

single.toarray()
Out[299]: 
array([[0.37796447, 0.37796447, 0.37796447, 0.37796447, 0.37796447,
        0.37796447, 0.37796447]])

query.toarray()
Out[311]: 
array([[0.57735027, 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        ],
       [0.70710678, 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]])

np.sum(np.square(query.toarray()), axis=1) # note how all rows with non-zero scores have been normalised to 1.
Out[3]: array([1., 1., 0.])

cos_similarity
Out[313]: array([[0.65465367, 0.53452248, 0.        ]])
KRKirov
  • 3,854
  • 2
  • 16
  • 20
  • Thanks! The same I did with ngrams. Like in above matrices can I map the ngram features found inside the query map to the according weight? So does query saves the according positions of the features? –  Jan 22 '19 at 20:41
  • Your question is not very clear. My example also uses ngrams (mono- and bi-grams), however the ngrams are words. By, using analyzer='char_wb' you are determining the frequency of character combinations, not of words.. Back to my example - singleTFIDF.vocabulary_ shows you the vocabulary and the position of each term in the matrices that the toarray() puts out. "example" has position 0,correspondingly the 0 positions for single and the first two query stringsquery are non-zero. If your new documents don't contain any of the words in the vectorizer vocabulary they get only 0s,. – KRKirov Jan 23 '19 at 01:15
  • Ok. With the position of vocab I could find out the feature within the matrices? –  Jan 23 '19 at 10:37
  • Precisely, .vocabulary_ shows you the term and its column index as key-value pairs. – KRKirov Jan 23 '19 at 13:15
0

The new document has new weights because the tfidfvectorizer normalizes the weights. Hence set the parameter norm as None. The default value of norm is l2.

To understand more about the effect of norm, I would recommend you to look at my answer for this question.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • Many thanks! I would like to verify the features and weights for a new document to analyse how the similiarity is created. Up to now I get the top 5 similiar docs but the docs with most of the features inside has the loswest similairity! That is strange. I will kook at your post and norm! –  Jan 23 '19 at 15:54
  • Transfom builds new tfidf weights, right? I thought it would use the weights learned from fit (on a large dta set possibly). I would rather like to score two docs with weights trained from fit previously. Do you get me? –  Jan 23 '19 at 18:28
  • I wonder why not just summ the features weight for the new document with the weight leanred from the corpus to identity the most similiar instead of using cosine sim. –  Jan 23 '19 at 18:59
  • Hey, I am really confused about the meaning of the positions of the array single eg from above. How can I get from the indices of single or query the according features from get_features or vocabulary? That is really confusing. I don't find anything anywhere.. –  Jan 23 '19 at 19:59