Extract Nouns from Text (Java)

Question

Does anyone know the easiest way to extract only nouns from a body of text?

I've heard about the TreeTagger tool and I tried giving it a shot but couldn't get it to work for some reason.

Any suggestions?

Thanks Phil

EDIT:

 import org.annolab.tt4j.*; 
TreeTaggerWrapper tt = new TreeTaggerWrapper(); 

try { tt.setModel("/Nouns/english.par"); 

tt.setHandler(new TokenHandler() { 
     void token(String token, String pos, String lemma) {    
     System.out.println(token+"\t"+pos+"\t"+lemma); } }); 
     tt.process(words); // words = list of words 

     } finally { tt.destroy(); 
}

That is my code, English is the language. I was getting the error : The type new TokenHandler(){} must implement the inherited abstract method TokenHandler.token. Am I doing something wrong?

Could you specify your problem? Especially the language would be nice to know... German for example has the nice advantage that all nouns have the first letter capitalized. — Chris, Dec 11 '09 at 18:00
I'm not familiar with TreeTagger API but I would start by instantiating TokenHandler outside the setHandler() - that might give a clearer message. My guess is that TokenHandler is abstract but ... — peter.murray.rust, Dec 11 '09 at 18:27
See also: http://stackoverflow.com/questions/608743/strategies-for-recognizing-proper-nouns-in-nlp. That relates to proper nouns. — peter.murray.rust, Dec 11 '09 at 19:19

peter.murray.rust · Accepted Answer · 2009-12-11T18:30:54.633

First you will have to tokenize your text. This may seem trivial (split at any whitespace may work for you) but formally it is harder. Then you have to decide what is a noun. Does "the car park" contain one noun (car park), two nouns (car, park) or one noun (park) and one adjective (car)? This is a hard problem, but again you may be able to get by without it.

Does "I saw the xyzzy" identify a noun not in a dictionary? The word "the" probably identifies xyzzy as a noun.

Where are the nouns in "time flies like an arrow". Compare with "fruit flies like a banana" (thanks to Groucho Marx).

We use the Brown tagger (Java) (http://en.wikipedia.org/wiki/Brown_Corpus) in the OpenNLP toolkit (opennlp.tools.lang.english.PosTagger; opennlp.tools.postag.POSDictionary on http://opennlp.sourceforge.net/) to find nouns in normal English and I'd recommend starting with that - it does most of your thinking for you. Otherwise look at any of the POSTaggers (http://en.wikipedia.org/wiki/POS_tagger) or (http://www-nlp.stanford.edu/links/statnlp.html#Taggers).

In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English, for example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus)

There is a very full list of NLP toolkits in http://en.wikipedia.org/wiki/Natural_language_processing_toolkits. I would strongly suggest you use one of those rather than trying to match against Wordnet or other collections.

+1 for the explanations. Some people seem to think that NLP isn't all that hard, when it actually is one of the most complex things in computing. There is a tremendous number of corner cases and everything will be useless when suddenly the language to process changes. And than, on a more theoretical level, you also have the problem that there is more than 1 definition for noun, or verb, or pronoun etc. — Maximilian Mayerl, Dec 11 '09 at 18:14
@Maximilian thanks for the support. We agree that it's hard. Luckily we are only trying to interpret language written by chemists and that's a good deal easier! — peter.murray.rust, Dec 11 '09 at 18:17
excellent post, thanks. Currently downloading lingpipe, Im on windows though, hope it doesnt have alot of nasty .sh scripts! haha — Phil, Dec 11 '09 at 18:22
We have used LingPipe but it's not open and we have to have an open system for our distribution. If you are just using it personally I don't think there is a problem. — peter.murray.rust, Dec 11 '09 at 18:29
Unfortunately there now seems no evidence that Groucho actually said this. — peter.murray.rust, Jan 22 '16 at 08:19

score 1 · Answer 2 · answered Mar 19 '13 at 08:43

my following code works with TreeTagger:

public List<String> tag(String str) {
    final List<String> tagLemme = new ArrayList<String>();
    String[] tokens =tokenizer.tokenize(str);
      System.setProperty("treetagger.home", "parametresTreeTagger/TreeTagger");
    TreeTaggerWrapper tt = new TreeTaggerWrapper<String>();
    try {
        tt.setModel("parametresTreeTagger/english/english.par");
        tt.setHandler(new TokenHandler<String>(){
                public void token(String token, String pos, String lemma) {
                        tagLemme.add(token + "_" + pos + "_" + lemma);
                        //System.out.println(token + "_" + pos + "_" + lemma);
                }
        });
        tt.process(asList(tokens));
     } catch (IOException e) {
        e.printStackTrace();
      } catch (TreeTaggerException e) {
        e.printStackTrace();
    }
finally {
        tt.destroy();
}
    return tagLemme;
}

I couldn't even install it properly http://stackoverflow.com/questions/15503388/treetagger-installation-successful-but-cannot-open-par-file — alvas, Mar 19 '13 at 15:36

teabot · Answer 3 · 2009-12-11T17:54:32.417

1

Check out LingPipe. This can supposedly pick out named entities from English text. But I must confess that NLP isn't my area of expertise.

edited Dec 11 '09 at 17:54

answered Dec 11 '09 at 17:48

teabot

15,358
11
64
79

score 1 · Answer 4 · answered Dec 11 '09 at 18:20

Based on your edit:

The error says that you must override the abstract method token, and you have a definition for token in your anonymous inner class, but maybe the signature of your token-override doesn't match the signature of the abstract method defined in TokenHandler?

score 0 · Answer 5 · answered Dec 11 '09 at 17:47

0

Have a look at the WordNet database. This lexical database. You could try matching each word against it and check if it's a noun.

I doubt that you will have 100% precision, though; the database doesn't have a match for every possible word in the english language, but at least it's a start.

answered Dec 11 '09 at 17:47

Scharrels

3,055
25
31

1

That's not really accurate. For example, take the sentence "He is walking to school." versus "He said that walking is exhausting." Now, in the second sentence, "walking" is a noun (a verb nominalized via a gerund), but in the first sentence it's the progressive form of the verb "to walk". And taht's just an example, there are more problems. – Maximilian Mayerl Dec 11 '09 at 17:51

score 0 · Answer 6 · edited Oct 10 '12 at 23:58

0

Easiest way would probably be to compare each word in the text with a dictionary of nouns. After that you're probably going to have to do some elementary parsing and accept approximate correctness in the results. Lots of online references to parsing natural languages.

edited Oct 10 '12 at 23:58

user229044

232,980
40
330
338

answered Dec 11 '09 at 17:49

High Performance Mark

77,191
7
105
161

score 0 · Answer 7 · answered Dec 11 '09 at 17:57

0

Find a dictionary web site with an API (e.g. WS, RESTful) which you can use to run search queries against.

The results should come in an easily consumable format (e.g. XML, JSON) and should of course include the word's lexical category.

answered Dec 11 '09 at 17:57

torbengee

91
2

Extract Nouns from Text (Java)

7 Answers7

Linked