1

I have a set of word phrases and I want to categorise them as given in the example below.

Example:

adaptive and intelligent educational system
adaptive and intelligent tutoring system
adaptive educational system

For a human it is easy to understand that the above mentioned 3 word phrases should come under one category.

Is there any easy way of doing it?

Currently, I am using affinity propagation clustering algorithm as follows using levenshtein distance.

words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

However, I did not get the desired outputs. Hence, please propose me a suitable approach to get my desired results.

  • 1
    See here: https://stackoverflow.com/questions/62328/is-there-an-algorithm-that-tells-the-semantic-similarity-of-two-phrases/43213509 – polm23 Aug 08 '17 at 07:28
  • Thanks a lot. It's very useful. –  Aug 11 '17 at 03:19

1 Answers1

0

Levenshtein distance works on characters.

From this point of view, "educational" and "tutoring" are about as different as possible.

If you want to cluster by semantic similarity, don't use character level similarity.

Unfortunately, semantic similarity is quite hard. You will need to use a huge knowledge base somehow. For example use the entire world wide web to learn that "tutoring" and "educational" are related. Or you could try e.g. WordNet etc.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194