5

I'm trying to use Textacy to calculate the TF-IDF score for a single word across the standard corpus, but am a bit unclear about the result I am receiving.

I was expecting a single float which represented the frequency of the word in the corpus. So why am I receiving a list (?) of 7 results?

"acculer" is actually a French word, so was expecting a result of 0 from an English corpus.

word = 'acculer'
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
tf_idf = vectorizer.fit_transform(word)
logger.info("tf_idf:")
logger.info(tfidf)

Output

tf_idf:
(0, 0)  2.386294361119891
(1, 1)  1.9808292530117262
(2, 1)  1.9808292530117262
(3, 5)  2.386294361119891
(4, 3)  2.386294361119891
(5, 2)  2.386294361119891
(6, 4)  2.386294361119891

The second part of the question is how can I provide my own corpus to the TF-IDF function in Textacy, esp. one in a different language?

EDIT

As mentioned by @Vishal I have logged the ouput using this line:

logger.info(vectorizer.vocabulary_terms)

It seems the provided word acculer has been split into characters.

{'a': 0, 'c': 1, 'u': 5, 'l': 3, 'e': 2, 'r': 4}

(1) How can I get the TF-IDF for this word against the corpus, rather than each character?

(2) How can I provide my own corpus and point to it as a param?

(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.

port5432
  • 5,889
  • 10
  • 60
  • 97

2 Answers2

14

Fundamentals

Lets get definitions clear before looking into the actual questions.

Assume our corpus contains 3 documents (d1, d2 and d3 respectively):

corpus = ["this is a red apple", "this is a green apple", "this is a cat"]

Term Frequency (tf)

tf (of a word) is defined as number of times a word appears in a document.

tf(word, document) = count(word, document) # Number of times word appears in the document

tf is defined for a word at document level.

tf('a',d1)     = 1      tf('a',d2)     = 1      tf('a',d3)     = 1
tf('apple',d1) = 1      tf('apple',d2) = 1      tf('apple',d3) = 0
tf('cat',d1)   = 0      tf('cat',d2)   = 0      tf('cat',d3)   = 1
tf('green',d1) = 0      tf('green',d2) = 1      tf('green',d3) = 0
tf('is',d1)    = 1      tf('is',d2)    = 1      tf('is',d3)    = 1
tf('red',d1)   = 1      tf('red',d2)   = 0      tf('red',d3)   = 0
tf('this',d1)  = 1      tf('this',d2)  = 1      tf('this',d3)  = 1

Using the raw counts has a problem that the tf values of words in longer documents have high values compared to the shorter document. This problem can be solved by normalizing the raw count values by dividing by the document length (number of words in the corresponding document). This is called l1 normalization. The document d1 can now be represented by the tf vector with all tf values of all the words in the vocubulary of the corpus. There is an another kind of normalizaiton called l2 which makes the l2 norm of the tf vector of the document equal to 1.

tf(word, document, normalize='l1') = count(word, document)/|document|
tf(word, document, normalize='l2') = count(word, document)/l2_norm(document)
|d1| = 5, |d2| = 5, |d3| = 4
l2_norm(d1) = 0.447, l2_norm(d2) = 0.447, l2_norm(d3) = 0.5, 

Code : tf

corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
# Convert docs to textacy format
textacy_docs = [textacy.Doc(doc) for doc in corpus]

for norm in [None, 'l1', 'l2']:
    # tokenize the documents
    tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

    # Fit the tf matrix 
    vectorizer = textacy.Vectorizer(apply_idf=False, norm=norm)
    doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

    print ("\nVocabulary: ", vectorizer.vocabulary_terms)
    print ("TF with {0} normalize".format(norm))
    print (doc_term_matrix.toarray())

Output:

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with None normalize
[[1 1 0 0 1 1 1]
 [1 1 0 1 1 0 1]
 [1 0 1 0 1 0 1]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l1 normalize
[[0.2  0.2  0.   0.   0.2  0.2  0.2 ]
 [0.2  0.2  0.   0.2  0.2  0.   0.2 ]
 [0.25 0.   0.25 0.   0.25 0.   0.25]]

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
TF with l2 normalize
[[0.4472136 0.4472136 0.        0.        0.4472136 0.4472136 0.4472136]
 [0.4472136 0.4472136 0.        0.4472136 0.4472136 0.        0.4472136]
 [0.5       0.        0.5       0.        0.5       0.        0.5      ]]

The rows in the tf matrix correspond to documents (hence 3 rows for our corpus) and columns correspond to each word in the vocabulary (index of the word shown in the vocabulary dictionary)

Inverse Document Frequency (idf)

Some words convey less information then others. For example words like the, a, an, this, that are very common words and they convey very less information. idf is a measure of the importance of the word. We consider a word appearing in many documents to be less informative compared to words appearing in few documents.

idf(word, corpus) = log(|corpus| / No:of documents containing word) + 1  # standard idf

For our corpus intuitively idf(apple, corpus) < idf(cat,corpus)

idf('apple', corpus) = log(3/2) + 1 = 1.405 
idf('cat', corpus) = log(3/1) + 1 = 2.098
idf('this', corpus) = log(3/3) + 1 = 1.0

Code : idf

textacy_docs = [textacy.Doc(doc) for doc in corpus]    
tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

vectorizer = textacy.Vectorizer(apply_idf=False, norm=None)
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("standard idf: ")
print (textacy.vsm.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, type_='standard'))

Output:

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
standard idf: 
[1.     1.405       2.098       2.098       1.      2.098       1.]

Term Frequency–Inverse Document Frequency(tf-idf): tf-idf is a a measure of how important a word is in the document in a corpus. tf of word weighted with its ids gives us the tf-idf measure of the word.

tf-idf(word, document, corpus) = tf(word, docuemnt) * idf(word, corpus)
tf-idf('apple', 'd1', corpus) = tf('apple', 'd1') * idf('apple', corpus) = 1 * 1.405 = 1.405
tf-idf('cat', 'd3', corpus) = tf('cat', 'd3') * idf('cat', corpus) = 1 * 2.098 = 2.098

Code : tf-idf

textacy_docs = [textacy.Doc(doc) for doc in corpus]

tokenized_docs = [
    doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
    for doc in textacy_docs]

print ("\nVocabulary: ", vectorizer.vocabulary_terms)
print ("tf-idf: ")

vectorizer = textacy.Vectorizer(apply_idf=True, norm=None, idf_type='standard')
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
print (doc_term_matrix.toarray())

Output:

Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
tf-idf: 
[[1.         1.405   0.         0.         1.         2.098   1.        ]
 [1.         1.405   0.         2.098      1.         0.      1.        ]
 [1.         0.      2.098      0.         1.         0.      1.        ]]

Now coming to the questions:

(1) How can I get the TF-IDF for this word against the corpus, rather than each character?

As seen above, there is no tf-idf defined independently, tf-idf of a word is with respect to a document in a corpus.

(2) How can I provide my own corpus and point to it as a param?

It is shown in the above samples.

  1. Convert the text documents into textacy Docs using textacy.Doc api
  2. Tokenzie the textacy.Doc's using to_terms_list method. (Using this method you can use add unigram, bigram or trigram into the vocabulary, filter out stop wordsm noramalize text etc)
  3. Use textacy.Vectorizer to create the term matrix from the tokenized documents. The term matrix returned is
    • tf (raw counts): apply_idf=False, norm=None
    • tf (l1 normalized): apply_idf=False, norm='l1'
    • tf (l2 normalized): apply_idf=False, norm='l2'
    • tf-idf (standard): apply_idf=True, idf_type='standard'

(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.

Yes you can, if and only if you treat each sentence as a separate document. In such a case the tf-idf vector (full row) of the corresponding document can be treated as a vector representation of the document (which is a single sentence in your case).

In case of our corpus (which infact contains a single sentence per document), the vector representation of d1 and d2 should be close as compared to vectors d1 and d3. Lets check cosin similarity and see :

cosine_similarity(doc_term_matrix)

Output

array([[1.        ,     0.53044716,     0.35999211],
       [0.53044716,     1.        ,     0.35999211],
       [0.35999211,     0.35999211,     1.        ]])

As you can see cosine_similarity(d1,d2) = 0.53 and cosine_similarity(d1,d3) = 0.35, so indeed d1 and d2 are more similar then d1 and d3 (1 being exactly similar and 0 being not similar - orthogonal vectors).

Once you train your Vectorizer you can pickle the trained object to a disk for later usage.

Conclusion

tf of a word is at document level, idfof a word is at corpus level and tf-idf of a word is at document with respect to the corpus. They are well suited for vector representation of a document (or a sentence when a document is made up of a single sentence). If you are interested in vector representation of words, then explore word embedding like (word2vec, fasttext, glove etc).

mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • Great Explanation. – vb_rises Apr 23 '19 at 11:57
  • If I used parameters analyzer='word' and token_pattern =, then will it tokenize as per sentence(assume that regex is correct) and created tf-idf on sentence level instead of word? If it works, is it good approach? – vb_rises Apr 23 '19 at 13:54
  • @Vishal in that case the token(s) will the word/words matching the regex, similar to bigram/trigram/ngram tokens. tf-idf will then be defined for that word/words matching the regex. – mujjiga Apr 23 '19 at 17:37
  • Yeah. But what if we define our regex in such a way that it finds sentences? If this works, then I guess @ardochhigh 's 3rd question will be answered. – vb_rises Apr 23 '19 at 17:41
  • 1
    With the latest version of textacy, you will need to update `textacy.Doc(doc)` to `textacy.make_spacy_doc(doc, lang=en)` where lang is optional. You will also need to traverse an additional layer for the term list `doc._.to_terms_list`. Lastly, update the Vectorizer; first import `from textacy import vsm` and then call the vectorizer with `vsm.Vectorizer`. – zwelz Mar 18 '20 at 17:37
  • one question - if ```corpus = ["there is a red apple and green apple", "this is a green apple", "this is a cat"]``` , and then I want to calculate the ```idf('apple', corpus)``` , does it become ```log(3/3) + 1 ``` or ```log(3/2) + 1``` ... does the ```apple``` occurring in the first document count once or twice – Rajarshi Ghosh Apr 06 '22 at 07:21
1

You can get TF-IDF for the word against the corpus.

docs = ['this is me','this was not that you thought', 'lets test them'] ## create a list of documents
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
vec.fit(docs) ##fit your documents

print(vec.vocabulary_) #print vocabulary, don't run for 2.5 million documents

Output: contains idf for each word and it is assigned a unique index in output

{u'me': 2, u'them': 6, u'that': 5, u'this': 7, u'is': 0, u'thought': 8, u'not': 3, u'lets': 1, u'test': 4, u'you': 10, u'was': 9}

print(vec.idf_) 

Output: prints idf value for each vocabulary word

[ 1.69314718  1.69314718  1.69314718  1.69314718  1.69314718  1.69314718 1.69314718  1.28768207  1.69314718  1.69314718  1.69314718]

Now as per your question, let's say you want to find tf-idf for some word, then you can get it as:

word = 'thought' #example    
index = vec.vocabulary_[word] 
>8
print(vec.idf_[index]) #prints idf value
>1.6931471805599454

Reference: 1. prepare-text

Now Doing the same operation with textacy

import spacy
nlp = spacy.load('en') ## install it by python -m spacy download en (run as administrator)

doc_strings = [
    'this is me','this was not that you thought', 'lets test them'
]
docs = [nlp(string.lower()) for string in doc_strings]
corpus = textacy.Corpus(nlp,docs =docs)
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, normalize='lower',as_strings=True,filter_stops=False) for doc in corpus))

print(vectorizer.terms_list)
print(doc_term_matrix.toarray())

Output

['is', 'lets', 'me', 'not', 'test', 'that', 'them', 'this', 'thought','was', 'you']


[[1.69314718 0.         1.69314718 0.         0.         0.
  0.         1.28768207 0.         0.         0.        ]
 [0.         0.         0.         1.69314718 0.         1.69314718
  0.         1.28768207 1.69314718 1.69314718 1.69314718]
 [0.         1.69314718 0.         0.         1.69314718 0.
  1.69314718 0.         0.         0.         0.        ]]

Reference: link

vb_rises
  • 1,847
  • 1
  • 9
  • 14
  • This looks good. Is there a way to save the ```vec``` object to disk after this step ```vec.fit(docs)``` ie: don't recalculate the corpus? – port5432 Apr 23 '19 at 05:11
  • 1
    Yes. you can store it using pickle module. `import pickle with open('vec','wb') as f: pickle.dump(vectorizer, f) f.close()` `with open('vec','rb') as f: temp = pickle.load(f) f.close()` – vb_rises Apr 23 '19 at 12:00