0

I would like to count how many unique words are used in some text. Tricky part is, that I would like to consider different forms of one word as just one word. Example:

I work.
He works.
I am working.
I have worked.

Unique words in this text are just these 5: [I, work, He, am, have] because there are 4 different forms of one (same) word - work.

I quess I need some dictionary for this, or some library, but after some googling didn't find anything. Anybody can help me? Thanks!

PS: I know that some words are exactly same, but their meaning is different. (example: When he leaves home, the leaves will cover the ground). Anyway, just ignore such cases - it's hard to cover them + they are rare and can't significantly affect the result.

qkx
  • 2,383
  • 5
  • 28
  • 50
  • 1
    @ryekayo How is regex going to help him with "go", "goes", "went", "gone"? And OP, how are you going to deal with "When he **leaves** home, the **leaves** will cover the ground"? Duplicate or unique? – RealSkeptic Mar 26 '15 at 21:54
  • @RealSkeptic as such cases are not possible to distiguish, I can consider them as same words - this is acceptable "bug". It happens just rarely, and it won't affect the result too much. All I want is to deal with most common cases of duplicating words. – qkx Mar 26 '15 at 21:59
  • 2
    I would recommend using an english [word stemmer](http://en.wikipedia.org/wiki/Stemming) and finding the unique root stems of your corpus. – Jake H Mar 26 '15 at 22:29

3 Answers3

2

For english language, you could use PorterStemmer from lucene's distribution. The idea is to keep for each word its stem, and store it to a set.

import java.util.HashSet;
import java.util.Set;

import org.tartarus.snowball.ext.PorterStemmer;

public class Test {
    public static void main(String[] args) {
        Set<String> stems = new HashSet<>();

        PorterStemmer stemmer = new PorterStemmer();
        String strings[] = new String[] { "I work.", "He works.",
                "I am working.", "I have worked." };
        for (String s : strings) {
            for (String word : s.split("[\\s\\.]+")) {
                stemmer.setCurrent(word);
                stemmer.stem();
                stems.add(stemmer.getCurrent());
            }
        }
        System.err.println(stems);
    }
}

Result:

[work, have, am, I, He]

If you decide to use lucene, you can start also using lucene's more advanced tokenizer functions. In the above example, we just split on whitespace and dot characters.

JuniorCompressor
  • 19,631
  • 4
  • 30
  • 57
0

You need a stemming library. I haven't used one directly (only through indexing process of Lucene. There is an API that can filter the words of your text to remove all related words as part of a preprocessing before counting the frequencies.

But there exists many implementation, for instance this one.

T.Gounelle
  • 5,953
  • 1
  • 22
  • 32
0

According to this page hosted by the Stanford NLP Group, you could use stemming or lemmatization to achieve what you want:

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

From all the links provided in that page, the only one that works is PorterStemmer, whose usage is explained in another answer.

For a lemmatizer, see this question here on SO, which suggests you use the Stanford Core NLP library.

Community
  • 1
  • 1
fps
  • 33,623
  • 8
  • 55
  • 110