How to retrieve all variants of a lexeme in Java?

Question

I am searching for a way to retrieve all variants of the lexeme of a specific word.

Example: running -> (run, runs, ran, running…)

I tried out Stanford NLP according to this post. However, the lemma-annotator only retrieves the lemma (running -> run), not the complete set of variants. Is there a way to do this with Stanford NLP or another Java Lib/Framework?

Clarification: I do not search for a stemmer. Also, I would like to avoid programming a new algorithm from scratch to crawl WordNet or similar dictionaries.

Chthonic Project · Accepted Answer · 2015-07-31T15:45:19.813

The short answer is that a standard NLP library or toolkit is unlikely to solve this problem. Like Stanford NLP, most libraries will only provide a mapping from word --> lemma. Note that this is a many-to-one function, i.e., the inverse function is not well-defined in a word space. It is, however, a well defined function from the space of words to the space of sets of words (i.e., it's a one-to-many mapping in word-space).

Without some form of explicit mapping being maintained, it is impossible to generate all the variants from a given lemma. This is a theoretical impossibility because lemmatization is a lossy, one-way function.

You can, however, generate a mapping of lemma --> set-of-words without much coding (and definitely without coding a new algorithm):

// Java
Map<String, Set<String>> inverseLemmaMap = new HashMap<>();

// Guava
Multimap<String, String> inverseLemmaMap = HashMultimap.create();

Then, as you annotate your corpus using Stanford NLP, you can obtain the lemma and its corresponding token, and populate the above map (or multimap). This way, after a single pass over your dataset, you will have the required inverse lemmatization.

Note that this will be restricted to the corpus/dataset you are using, and not all words in the English language will be included.

Another note is that people often think that an inflection is uniquely determined by the part of speech. This is incorrect:

String s = "My running was beginning to hurt me. I was running all day."

The first instance of running is tagged NN, while the second instance is the present continuous tense of the verb, tagged VBG. This is what I meant by "lossy, one-way function" earlier in my answer.

Thanks. Good to know that there is no use to search for some reversed-lemmatization functionality. I will try out your suggestion. — tbm, Jul 31 '15 at 09:47

How to retrieve all variants of a lexeme in Java?

1 Answers1