2

My goals is to find a similarity value between two documents (collections of words). I have already found several answers like this SO post or this SO post which provide Python libraries that achieve this, but I have trouble understanding the approach and making it work for my use case.

If I understand correctly, TF-IDF of a document is computed with respect to a given term, right? That's how I interpret it from the Wikipedia article on this: "tf-idf...is a numerical statistic that is intended to reflect how important a word is to a document".

In my case, I don't have a specific search term which I want to compare to the document, but I have two different documents. I assume I need to first compute vectors for the documents, and then take the cosine between these vectors. But all the answers I found with respect to constructing these vectors always assume a search term, which I don't have in my case.

Can't wrap my head around this, any conceptual help or links to Java libraries that achieve this would be highly appreciated.

Community
  • 1
  • 1
gmazlami
  • 684
  • 2
  • 9
  • 18
  • 1
    Run a term extraction before, and once you have the list of terms with their frequencies for both corpora, calculate the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). – Wiktor Stribiżew Nov 23 '16 at 14:17
  • @Wiktor Stribiżew: Thanks for the suggestion. So I extract the terms of both documents into a list. And then for each of those terms, I compute the tf-idf values for each of the two documents, which gives me two vectors, from which i can compute the cosine similarity. Am I understanding this correctly? – gmazlami Nov 23 '16 at 14:20
  • 1
    Yes, basically that is how it is done. Based on the term frequency, get the vectors, TF-IDF, and calculate the cosine similarity. Also, make sure you use stemming to normalize word forms you extracted to reduce noise. – Wiktor Stribiżew Nov 23 '16 at 14:25
  • Thanks so much for the tip. I will try this. – gmazlami Nov 23 '16 at 14:50

1 Answers1

2

I suggest running terminology extraction first, together with their frequencies. Note that stemming can also be applied to the extracted terms to avoid noise in during the subsequent cosine similarity calculation. See Java library for keywords extraction from input text SO thread for more help and ideas on that.

Then, as you yourself mention, for each of those terms, you will have to compute the TF-IDF values, get the vectors and compute the cosine similarity.

When calculating TF-IDF, mind that 1 + log(N/n) (N standing for the total number of corpora and n standing for the number of corpora that include the term) formula is better since it avoids the issue when TF is not 0 and IDF turns out equal to 0.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Just to clarify, in log(N/n) N is total no of documnets and n is total no of doc. that includes the term right? So if when we do this between two documents as in the question, isn't the value always going to be either log(2/2) or log(1/2)? – Ravindu Jul 03 '18 at 06:19
  • @Ravindu Yes, true. FYI note that by *corpus* we may mean not only whole documents full of paragraphs and sentences, we may also compare single sentences, or items in the string array. Another FYI, see [this *How does TfidfVectorizer work in layman's terms* article](https://www.quora.com/How-does-TfidfVectorizer-work-in-laymans-terms). – Wiktor Stribiżew Jul 03 '18 at 07:33
  • thanks. SO my question is how have we use tf-idf to compatre two documnets since it's always gonna be log(2/2) or log(1/2)? log(2/2) is 0. whcih means if a term is in both documents, the tf-idf is going to be just the tf * (1+0) – Ravindu Jul 03 '18 at 07:47
  • @Ravindu The point here is to calculate the dot product of the TF-IDF vectors of the two documents and divide that by the product of their norms. Here is a good article on [calculating cosine similarity step by step in Python](https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/), and [here is a Java version](http://computergodzilla.blogspot.com/2013/07/how-to-calculate-tf-idf-of-document.html). – Wiktor Stribiżew Jul 03 '18 at 08:42
  • but in each of these examples they search a text against more than 2 docs. What i'm saying is if I want to find the similarity between 2 docs, instead of just using cosine similarty only, how can we use IF-IDF? In cosine similarty we can create a vector by finding each words occurenct for both files and use the cosine algo for that. But how can we use TF-IDF for this matter? P.S. Sorry if im making this too complex – Ravindu Jul 03 '18 at 11:08