I want to find whether two web pages are similar or not. Can someone suggest if python nltk with wordnet similarity functions helpful and how? What is the best similarity function to be used in this case?
-
1NLTK could well be useful. Have a look at the (open source) O'Reilly book - it is published on nltk.org if you can't find/afford the print version. This should point you in the right direction as it covers most of what NLTK can do. – winwaed Jun 06 '11 at 13:08
-
[link to Python 2 book for convenience](http://www.nltk.org/book_1ed/) - They are currently working on a revised version for Python 3 and NLTK 3 – Ksofiac Jun 14 '17 at 20:53
-
[link to Python 3 book](http://www.nltk.org/book/) – Ksofiac Jun 15 '17 at 14:02
2 Answers
The spotsigs paper mentioned by joyceschan addresses content duplication detection and it contains plenty of food for thought.
If you are looking for a quick comparison of key terms, nltk
standard functions might suffice.
With nltk
you can pull synonyms of your terms by looking up the synsets contained by WordNet
>>> from nltk.corpus import wordnet
>>> wordnet.synsets('donation')
[Synset('contribution.n.02'), Synset('contribution.n.03')]
>>> wordnet.synsets('donations')
[Synset('contribution.n.02'), Synset('contribution.n.03')]
It understands plurals and it also tells you which part of speech the synonym corresponds to
Synsets are stored in a tree with more specific terms at the leaves and more general ones at the root. The root terms are called hypernyms
You can measure similarity by how close the terms are to the common hypernym
Watch out for different parts of speech, according to the NLTK cookbook they don't have overlapping paths, so you shouldn't try to measure similarity between them.
Say, you have two terms donation and gift, you can get them from synsets
but in this example I initialized them directly:
>>> d = wordnet.synset('donation.n.01')
>>> g = wordnet.synset('gift.n.01')
The cookbook recommends Wu-Palmer Similarity method
>>> d.wup_similarity(g)
0.93333333333333335
This approach gives you a quick way to determine if the terms used correspond to related concepts. Take a look at Natural Language Processing with Python to see what else you can do to help your analysis of text.

- 4,111
- 4
- 24
- 36
-
thanks mate that was helpful.But,using those stuff I can find similarity among a pair of words but how do I do that for sentences. – station Jun 07 '11 at 12:26
-
1
-
1@user567797 no prob. This paper outlines the algorithm for measuring semantic similarity between two sentences. http://www.google.com/url?sa=t&source=web&cd=2&ved=0CCYQFjAB&url=http%3A%2F%2Fwordnetdotnet.googlecode.com%2Fsvn%2Ftrunk%2FProjects%2FThanh%2FPaper%2FWordNetDotNet_Semantic_Similarity.pdf&rct=j&q=similarity%20sentences&ei=XEruTanSLcXegQetyeSVDw&usg=AFQjCNF9fWcVrWZ4_cBZcfW_p7fFxaL_1A&sig2=qY7LW7YWGzNXMhOOPS5Llw&cad=rja – AnalyticsBuilder Jun 07 '11 at 16:00