How do you write a program to find if certain words are similar?

Question

Ie: "college" and "schoolwork" and "academy" belong in the same cluster, the words "essay", "scholarships" , "money" also belong in the same cluster. Is this a ML or NLP problem?

Seriously, though, the discipline of Natural Language Processing commonly uses a number if techniques, one of which is Machine Learning. Without a model rooted in some sort of NLP theory, what features would you be able to use to tackle this as an ML problem? — tripleee, Jan 04 '13 at 06:09

score 15 · Answer 1 · edited May 23 '17 at 11:55

It depends on how strict your definition of similar is.

Machine Learning Techniques

As others have pointed out, you can use something like latent semantic analysis or the related latent Dirichlet allocation.

Semantic Similarity and WordNet

As was pointed out, you may wish to use an existing resource for something like this.

Many research papers (example) use the term semantic similarity. The basic idea is of computing this is usually done by finding the distance between two words on a graph, where a word is a child if it is a type of its parent. Example: "songbird" would be a child of "bird". Semantic similarity can be used as a distance metric for creating clusters, if you wish.

Example Implementation

In addition, if you put a threshold on the value of some semantic similarity measure, you can get a boolean True or False. Here is a Gist I created (word_similarity.py) that uses NLTK's corpus reader for WordNet. Hopefully that points you towards the right direction, and gives you a few more search terms.

def sim(word1, word2, lch_threshold=2.15, verbose=False):
    """Determine if two (already lemmatized) words are similar or not.

    Call with verbose=True to print the WordNet senses from each word
    that are considered similar.

    The documentation for the NLTK WordNet Interface is available here:
    http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
    """
    from nltk.corpus import wordnet as wn
    results = []
    for net1 in wn.synsets(word1):
        for net2 in wn.synsets(word2):
            try:
                lch = net1.lch_similarity(net2)
            except:
                continue
            # The value to compare the LCH to was found empirically.
            # (The value is very application dependent. Experiment!)
            if lch >= lch_threshold:
                results.append((net1, net2))
    if not results:
        return False
    if verbose:
        for net1, net2 in results:
            print net1
            print net1.definition
            print net2
            print net2.definition
            print 'path similarity:'
            print net1.path_similarity(net2)
            print 'lch similarity:'
            print net1.lch_similarity(net2)
            print 'wup similarity:'
            print net1.wup_similarity(net2)
            print '-' * 79
    return True

Example output

>>> sim('college', 'academy')
True

>>> sim('essay', 'schoolwork')
False

>>> sim('essay', 'schoolwork', lch_threshold=1.5)
True

>>> sim('human', 'man')
True

>>> sim('human', 'car')
False

>>> sim('fare', 'food')
True

>>> sim('fare', 'food', verbose=True)
Synset('fare.n.04')
the food and drink that are regularly served or consumed
Synset('food.n.01')
any substance that can be metabolized by an animal to give energy and build tissue
path similarity:
0.5
lch similarity:
2.94443897917
wup similarity:
0.909090909091
-------------------------------------------------------------------------------
True

>>> sim('bird', 'songbird', verbose=True)
Synset('bird.n.01')
warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings
Synset('songbird.n.01')
any bird having a musical call
path similarity:
0.25
lch similarity:
2.25129179861
wup similarity:
0.869565217391
-------------------------------------------------------------------------------
True

>>> sim('happen', 'cause', verbose=True)
Synset('happen.v.01')
come to pass
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
Synset('find.v.01')
come upon, as if by accident; meet with
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
True

@WesleyBaugh Any pointers on how much code you executed before arriving at that number in the threshold? — John Strood, Sep 11 '18 at 11:54
In addition, this code tells you that all the similarity measures mentioned here don't distinguish between "similarity" and "relatedness" of two words. For instance, you'll find this code returns True for "love" and "hate". They are clearly antonyms but are still related (by a concept of a "feeling"). More here: https://linguistics.stackexchange.com/questions/9084/what-do-wordnetsimilarity-scores-mean — John Strood, Sep 14 '18 at 12:46

score 3 · Answer 2 · answered Jan 03 '13 at 23:32

3

I suppose you could build your own database of such associations sing ML and NLP techniques, but you might also consider querying existing resources such as WordNet to get the job done.

answered Jan 03 '13 at 23:32

hoffm

2,386
23
36

score 2 · Answer 3 · answered Jan 08 '13 at 23:30

If you have a sizable collection of documents related to the topic of interest, you might want to look at Latent Direchlet Allocation. LDA is a fairly standard NLP technique that automatically clusters words into topics, where similarity between words is determined by collocation in the same document (you can treat a single sentence as a document if that serves your needs better).

You'll find a number of LDA toolkits available. We'd need more detail on your exact problem before recommending one over another. I'm not enough of an expert to make that recommendation anyway, but I can at least suggest you look at LDA.

score 1 · Answer 4 · answered Jan 04 '13 at 05:52

1

The famous quote regarding your question is by John Rupert Firth in 1957:

You shall know a word by the company it keeps

To start delving into this topic you can look into this presentation.

answered Jan 04 '13 at 05:52

Vsevolod Dyomkin

9,343
2
31
36

score 1 · Answer 5 · answered Nov 27 '17 at 12:41

Word2Vec can play role to find similar words (contextually/semantically). In word2vec, we have words as vector in n-dimensional space, and can calculate distance between words (Euclidean Distance) or can simply make clusters.

After this, we can come up with some numerical value for similarity b/w 2 words.

How do you write a program to find if certain words are similar?

5 Answers5

Machine Learning Techniques

Semantic Similarity and WordNet

Example Implementation

Linked