8

i'm searching a java library or something to do stemming of italian strings of words.

The goal is to compare italian words. In this moment words like "attacco", "attacchi","attaccare" etc., are considered different, instead I want returned a true comparison.

I found something like Lucene, snowball.tartarus.org, etc. Is there something else useful, or how can I use them in java?

Thanks for answers.

Tony Rad
  • 2,479
  • 20
  • 32
Schiawo
  • 95
  • 7
  • You might have better luck with the Linguist's stack exchange. A bunch of linguists are now "Computational Linguists" and have a good knowledge of linguist library stuff. SOURCE: I work with a bunch of computational linguists. – thatidiotguy Nov 14 '12 at 14:47

1 Answers1

10

Download Snowball for Java here.

It includes a class named org.tartarus.snowball.ext.italianStemmer which extends SnowballStemmer.

To use a SnowballStemmer please take a look at the following test code for verb attaccare present tense:

import org.junit.Test;
import org.tartarus.snowball.SnowballStemmer;
import org.tartarus.snowball.ext.italianStemmer;

public class SnowballItalianStemmerTest {

    @Test
    public void testSnowballItalianStemmerAttaccare() {

        SnowballStemmer stemmer = (SnowballStemmer) new italianStemmer();

        String[] tokens = "attacco attacchi attacca attacchiamo attaccate attaccano".split(" ");    
        for (String string : tokens) {
            stemmer.setCurrent(string);
            stemmer.stem();
            String stemmed = stemmer.getCurrent();
            Assert.assertEquals("attacc", stemmed);
            System.out.println(stemmed);
        }

    }

}

Output:

attacc
attacc
attacc
attacc
attacc
attacc

For another example of use see TestApp.java included in the same tgz file.

Lucene, which is written in Java, uses Snowball for stemming, for example as a filter in SnowballFilter.

Tony Rad
  • 2,479
  • 20
  • 32
  • Is there a way to provide a long list of stemmed words, such as a dictionary, in order to compare words? For example: for(String string : AssertList){Assert.assertEquals(string,stemmed)} – Schiawo Nov 15 '12 at 11:05
  • Are you looking for a strategy to test the stemmer? – Tony Rad Nov 15 '12 at 14:29
  • No, doesn't matter. The problem was that I don't know previosly what words are in each array. Instead i did what I was trying to do with this code and without assert: 'private String[] stemmed(String[] x){ SnowballStemmer stemmer = (SnowballStemmer) new italianStemmer(); String[] s= new String[x.length]; for (int i=0;i – Schiawo Nov 15 '12 at 15:58
  • It looks good. The Assert was used only to test drive the stemmer class, and show its behavior. – Tony Rad Nov 15 '12 at 16:10