Lemmatization java

Question

I am looking for a lemmatisation implementation for English in Java. I found a few already, but I need something that does not need to much memory to run (1 GB top). Thanks. I do not need a stemmer.

Do you need true lemmatization (usually requires a decent-sized list of words), or is a stemmer like Porter, Snowball, or Paice-Husk good enough? — erickson, Oct 16 '09 at 14:00
@erickson - Do you know any true lemmatization (list of words)? I need it if there is any. — Ali Shakiba, Jun 10 '11 at 13:49
@JohnS - The best English word list I have found is the one developed by players of the word game Scrabble, OWL2. Unfortunately, it is not "open". That, in conjunction with something like WordNet, might serve as the basis for a good lemmatizer. But I don't know of anyone who has done it. — erickson, Jun 10 '11 at 17:10
This question appears to be off-topic because it is belong on software recommendations. — demongolem, Jan 04 '15 at 05:44

score 37 · Answer 1 · edited Aug 31 '16 at 18:24

The Stanford CoreNLP Java library contains a lemmatizer that is a little resource intensive but I have run it on my laptop with <512MB of RAM.

To use it:

Download the jar files;
Create a new project in your editor of choice/make an ant script that includes all of the jar files contained in the archive you just downloaded;
Create a new Java as shown below (based upon the snippet from Stanford's site);

import java.util.Properties;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        // StanfordCoreNLP loads a lot of models, so you probably
        // only want to do this once per execution
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);

        // run all Annotators on this text
        this.pipeline.annotate(document);

        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }

        return lemmas;
    }
}

is there a tutorial ? the documentation looks pretty confusing ! — CTsiddharth, Mar 01 '12 at 11:08
I'm getting this exception when I run this code: Exception in thread "main" java.lang.NoSuchMethodError: edu.stanford.nlp.util.Generics.newHashMap()Ljava/util/Map; at edu.stanford.nlp.pipeline.AnnotatorPool.(AnnotatorPool.java:27) at edu.stanford.nlp.pipeline.StanfordCoreNLP.getDefaultAnnotatorPool(StanfordCoreNLP.java:306) at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:250) at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:127) at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:123) ..... Any ideas? — user1028408, Apr 16 '13 at 17:27
The javadoc is here http://www-nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/Morphology.html and this code example is easier to understand - http://sw.deri.org/2008/11/match/aroma/src/utils/StanfordPOSTagger.java Just use the instance method Morphology#lemma — ruhong, Feb 25 '15 at 12:46
The output is free from any POS tags. What's the sense of POS tagging if I can't access the class of the token/word? Or is there a way with this example to get the _NN, _JJ, etc. information? — Pete, Aug 26 '15 at 08:45
[How does this work on a single word](http://stackoverflow.com/questions/34963203/stanford-nlp-how-to-lemmatize-single-word)? — Stefan Falk, Jan 23 '16 at 12:05

score 17 · Answer 2 · answered Nov 11 '13 at 14:58

Chris's answer regarding the Standford Lemmatizer is great! Absolutely beautiful. He even included a pointer to the jar files, so I didn't have to google for it.

But one of his lines of code had a syntax error (he somehow switched the ending closing parentheses and semicolon in the line that begins with "lemmas.add...), and he forgot to include the imports.

As far as the NoSuchMethodError error, it's usually caused by that method not being made public static, but if you look at the code itself (at http://grepcode.com/file/repo1.maven.org/maven2/com.guokr/stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h) that is not the problem. I suspect that the problem is somewhere in the build path (I'm using Eclipse Kepler, so it was no problem configuring the 33 jar files that I use in my project).

Below is my minor correction of Chris's code, along with an example (my apologies to Evanescence for butchering their perfect lyrics):

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");

        /*
         * This is a pipeline that takes in a string and returns various analyzed linguistic forms. 
         * The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator), 
         * and then other sequence model style annotation can be used to add things like lemmas, 
         * POS tags, and named entities. These are returned as a list of CoreLabels. 
         * Other analysis components build and store parse trees, dependency graphs, etc. 
         * 
         * This class is designed to apply multiple Annotators to an Annotation. 
         * The idea is that you first build up the pipeline by adding Annotators, 
         * and then you take the objects you wish to annotate and pass them in and 
         * get in return a fully annotated object.
         * 
         *  StanfordCoreNLP loads a lot of models, so you probably
         *  only want to do this once per execution
         */
        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "How could you be seeing into my eyes like open doors? \n"+
                "You led me down into my core where I've became so numb \n"+
                "Without a soul my spirit's sleeping somewhere cold \n"+
                "Until you find it there and led it back home \n"+
                "You woke me up inside \n"+
                "Called my name and saved me from the dark \n"+
                "You have bidden my blood and it ran \n"+
                "Before I would become undone \n"+
                "You saved me from the nothing I've almost become \n"+
                "You were bringing me to life \n"+
                "Now that I knew what I'm without \n"+
                "You can've just left me \n"+
                "You breathed into me and made me real \n"+
                "Frozen inside without your touch \n"+
                "Without your love, darling \n"+
                "Only you are the life among the dead \n"+
                "I've been living a lie, there's nothing inside \n"+
                "You were bringing me to life.";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}

Here is my results (I was very impressed; it caught "'s" as "is" (sometimes), and did almost everything else perfectly):

Starting Stanford Lemmatizer

Adding annotator tokenize

Adding annotator ssplit

Adding annotator pos

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.7 sec].

Adding annotator lemma

[how, could, you, be, see, into, my, eye, like, open, door, ?, you, lead, I, down, into, my, core, where, I, have, become, so, numb, without, a, soul, my, spirit, 's, sleep, somewhere, cold, until, you, find, it, there, and, lead, it, back, home, you, wake, I, up, inside, call, my, name, and, save, I, from, the, dark, you, have, bid, my, blood, and, it, run, before, I, would, become, undo, you, save, I, from, the, nothing, I, have, almost, become, you, be, bring, I, to, life, now, that, I, know, what, I, be, without, you, can, have, just, leave, I, you, breathe, into, I, and, make, I, real, frozen, inside, without, you, touch, without, you, love, ,, darling, only, you, be, the, life, among, the, dead, I, have, be, live, a, lie, ,, there, be, nothing, inside, you, be, bring, I, to, life, .]

[How does this work for a single word](http://stackoverflow.com/questions/34963203/stanford-nlp-how-to-lemmatize-single-word)? — Stefan Falk, Jan 23 '16 at 12:05

Joseph Shih · Answer 3 · 2015-04-22T03:34:21.090

0

You can try the free Lemmatizer API here: http://twinword.com/lemmatizer.php

Scroll down to find the Lemmatizer endpoint.

This will allow you to get "dogs" to "dog", "abilities" to "ability".

If you pass in a POST or GET parameter called "text" with a string like "walked plants":

// These code snippets use an open-source library. http://unirest.io/java
HttpResponse<JsonNode> response = Unirest.post("[ENDPOINT URL]")
.header("X-Mashape-Key", "[API KEY]")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("Accept", "application/json")
.field("text", "walked plants")
.asJson();

You get a response like this:

{
  "lemma": {
    "plant": 1,
    "walk": 1
  },
  "result_code": "200",
  "result_msg": "Success"
}

edited Apr 22 '15 at 03:34

answered Apr 17 '15 at 15:07

Joseph Shih

1,244
13
25

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Daniel Kelley Apr 17 '15 at 15:43
Thanks for the suggestion. I've included documentation on how to use it. – Joseph Shih Apr 20 '15 at 09:33

score -1 · Answer 4 · answered Oct 16 '09 at 16:44

-1

There is a JNI to hunspell, which is the checker used in open office and FireFox. http://hunspell.sourceforge.net/

answered Oct 16 '09 at 16:44

user187457

25
1

1

It's a spell checker, not a lemmatizer. – Chthonic Project Sep 14 '14 at 05:41
2

It isn't only a spell checker, you can use it for lemmatization. In the new version of Lucene we also can find a hunspell filter. – András Jan 10 '15 at 09:25

score -3 · Answer 5 · answered Oct 16 '09 at 13:47

-3

Check out Lucene Snowball.

answered Oct 16 '09 at 13:47

Zed

57,028
9
76
100

11

Lucene Snowball is a stemmer, not a lemmatizer. – John Thompson Nov 29 '13 at 20:34
For those who still want a stemmer anyway, I'll leave a valid link here: https://github.com/master/spark-stemming – Luís Henriques Jun 28 '18 at 09:17

Lemmatization java

5 Answers5

Linked