2

I am working on project to find similarity between two sentences/documents using tf-idf measure.

I tried the following sample code :

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity  

documents = (
"The sky is blue",
"The sun is bright"
)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cosine = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print cosine

and the similarity between the two sentences is

[[ 1.          0.33609693]]

Now my question is how can I show the similarity in a graphical/Visualization format. Something like a Venn diagram where intersection value becomes the similarity measure or any other plots available in matplotlib or any python libraries.

Thanks in Advance

Coder 477
  • 435
  • 3
  • 6
  • 16

1 Answers1

2

The simplest approach towards a Venn diagram is to draw two circles with radius r and a distance of the centers of d = 2 * r * (1 - cosine[0][i]), where i is the line index you are comparing to. If the sentences are identical, you have d == 0 is True, i.e. both circles are identical. If the two sentences have nothing in common, you have d == 2*r is True, so then the circles are disjunct (they touch at one point).

The code to draw circles is already present in StackOverflow.

EDIT: This approach draws a Venn diagram from the output of your code:

## import matplotlib for plotting the Venn diagram
import matplotlib.pyplot as plt

## output of your first part
cosine = [[ 1., 0.33609693]]

## set constants
r = 1
d = 2 * r * (1 - cosine[0][1])

## draw circles
circle1=plt.Circle((0, 0), r, alpha=.5)
circle2=plt.Circle((d, 0), r, alpha=.5)
## set axis limits
plt.ylim([-1.1, 1.1])
plt.xlim([-1.1, 1.1 + d])
fig = plt.gcf()
fig.gca().add_artist(circle1)
fig.gca().add_artist(circle2)
## hide axes if you like
# fig.gca().get_xaxis().set_visible(False)
# fig.gca().get_yaxis().set_visible(False)
fig.savefig('venn_diagramm.png')

Setting the alpha value when drawing circles makes them appear semitransparent. Thus, the overlap is twice as opaque as the non-overlapping parts of the circles.

Community
  • 1
  • 1
jkalden
  • 1,548
  • 4
  • 24
  • 26
  • what should be the radius for the circle, there should be 2 circles ..so should both of them have same radius? how can centre of circles determined? – Coder 477 Dec 23 '14 at 11:43
  • Both are your choice! If you choose (0,0) for the first circle, you'll have (d,0) or (0,d) as center for the second. If you have no idea for r, set it to 1. – jkalden Dec 23 '14 at 11:45
  • then how can the d value here, help in showing intersection. could you explain with data or a code sample – Coder 477 Dec 23 '14 at 11:47
  • I try to give you some hints. You are to code it Did you follow the accepted answer in the linked question? – jkalden Dec 23 '14 at 11:54
  • how can d value help in showing intersection, that point is not clear if d becomes 0 then a circle cannot be plotted then how can 2 circles can be shown – Coder 477 Dec 23 '14 at 12:06
  • See the added code. If you want to get a plot window instead of saving the plot to file, replace the last line by `plt.show()`. – jkalden Dec 23 '14 at 13:04
  • If the solution is fine for you, please [mark the answer as accepted](http://stackoverflow.com/help/someone-answers) for other users to see there is a solution. Thank you! – jkalden Oct 26 '15 at 08:51