I am looking for a utility (library) that will take in a collection of keywords (maybe 20: for instance, from the results of an LDA run on a text corpus) and return a few (2-5) word description of what best ties the original word collection together. Such a utility might work by looking up the synonyms for each keyword (say, using WordNet), adding to them the synonyms of those synonyms, and then finding the short word phrase that represents the biggest overlap (perhaps in a K-means sense). Does anybody know of such a utility.
1 Answers
If we deal with Wordnet and individual words, them maybe what you are looking for is the lowest common hypernym, that is, the most specific concept such as all your words are special cases of this concept.
Based on the answer Find lowest common hypernym given multiple words in WordsNet (Python), we can write a function that looks for LCH as follows:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def find_common(words):
all_hypernyms = {}
for word in words:
synsets = wn.synsets(word)
if not synsets:
print(f'word "{word}" has no synsets, skipping it')
continue
all_hypernyms[word] = set(
self_synset
for synset in synsets
for self_synsets in synset._iter_hypernym_lists()
for self_synset in self_synsets
)
if not all_hypernyms:
print("No valid words to calculate hyprnyms")
return
common_hypernyms = set.intersection(*all_hypernyms.values())
if not common_hypernyms:
print("The words have no common hypernyms")
return
ordered_hypernyms = sorted(common_hypernyms, key=lambda x: -x.max_depth())
return ordered_hypernyms[0]
Then you can use this function to find the lowest common hypernym for a set of words (if there is one)
result = find_common(['cat', 'dog', 'mouse', 'wtf'])
# word "wtf" has no synsets, skipping it
print(result.lemma_names()[0])
# placental
print(result.definition())
# mammals having a placenta; all mammals except monotremes and marsupials
result = find_common(['house', 'cathedral', 'castle'])
print(result.lemma_names()[0])
# building
print(result.definition())
# a structure that has a roof and walls and stands more or less permanently in one place
Of course, this will break if we add a word that has no close relation to all other words in the set. But you can handle such outliers if you perform something like agglomerative clustering over your words (when the distance between words is the shortest wordnet path between them) to find the subset of words that fit together well.

- 10,958
- 44
- 73