Simple Natural Language Processing Startup for Java

Question

I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution.

Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the site but dont know how to use them? do i need to train or something?.. Here is what i want to know-

How to install / set up a nlp system which can-

parse a English sentence words
identify the different parts of speech

see http://stackoverflow.com/questions/22904025/java-or-python-for-natural-language-processing — alvas, Apr 07 '14 at 07:58

score 11 · Accepted Answer · answered Apr 29 '11 at 16:47

You say that you need to 'parse' each sentence. You probably already know this, but just to be explicit, in NLP, the term 'parse' usually means to recover some hierarchical syntactic structure. The most common types are constituent structure (e.g., via a context-free grammar) and dependency structure.

If you need hierarchical structure, I'd recommend you consider just starting with a parser. Most parsers I'm aware of include POS tagging during parsing, and may provide higher accuracy tagging than finite-state POS taggers (Caveat - I'm much more familiar with constituent parsers than with dependency parsers. It's possible some or most dependency parsers would require POS tags as input).

The big downside to parsing is the time complexity. Finite-state POS taggers often run at thousands of words per second. Even greedy dependency parsers are considerably slower, and constituent parsers generally run at 1-5 sentences per second. So if you don't need hierarchical structure, you probably want to stick with a finite-state POS tagger for efficiency.

If you do decide you need parse structure, a few recommendations:

I think the Stanford parser suggested by @aab includes both a constituent parser and a dependency parser.

The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ) is a pretty well-known PCFG constituent parser, achieves state-of-the-art accuracy (equal or superior to the Stanford parser, I believe), and is reasonably efficient (~3-5 sentences per second).

The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, and improves efficiency to around 15-20 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.

Warning: both of these parsers are research code, with all the problems that engenders. But I'd love to see people actually using BUBS, so if it's of use to you, give it a try and contact me with problems, comments, suggestions, etc.

And a couple Wikipedia references for background if needed:

Context-free grammars: http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
Dependency grammars: http://en.wikipedia.org/wiki/Dependency_grammar

If you don't need parse structure, then definitely stick with a finite-state tagger. It should be much faster and simpler, and pretty comparable in accuracy (at least if you can find a tagging model trained on comparable text). The Stanford POS Tagger is probably a good bet. — AaronD, Apr 29 '11 at 18:24
I am really troubled by so many tools.. I dont have a good internet connection so it would take time to download a new one(stanford). It would be nice if you can help me to do it with openNLP. As i have gone a little further - http://stackoverflow.com/questions/5836148/how-to-use-opennlp-with-java i just now need to use it from a Java application — shababhsiddique, Apr 29 '11 at 18:57
@AaronD I am using the Berkeley Parser but the site is not helpful when it comes to tutorials. Do you where I can find a brief description of some bash commands that may be helpful to get started with? — jmishra, Oct 16 '12 at 07:27
@ladiesMan217 With the Berkeley Parser, you're probably pretty much on your own. The primary author (Slav Petrov) is still working on it occasionally for specific research experiments, but it's not really a supported production system. We've included a little more documentation with BUBS; we're trying to write some more as time permits, and I do my best to answer specific questions as they arise. — AaronD, Oct 16 '12 at 17:18

score 4 · Answer 2 · answered Apr 29 '11 at 14:53

Generally you'd do these two tasks in the other order:

Do part-of-speech tagging
Run a parser using the POS tags as input

OpenNLP's documentation isn't that thorough and some of it's gotten hard to find due to the switch to apache. Some (potentially slightly out-of-date) tutorials are available in the old SF wiki.

You might want to take a look at the Stanford NLP tools, in particular the Stanford POS Tagger and the Stanford Parser. Both have downloads that include pre-trained model files and they also have demo files in the top-level directory that show how to get started with the API and short shell scripts that show how to use the tools from the command-line.

LingPipe might be another good toolkit to check out. A quick search here will lead you to a number of similar questions with links to other alternatives, too!

Which one should i use? Stanford CoreNLP or The Stanford Parser or The Stanford POS Tagger — shababhsiddique, Apr 29 '11 at 15:16
It depends on what you want/need to do. CoreNLP includes the other two tools plus other annotators, so if you're just experimenting with different kinds of annotation, CoreNLP would be a good place to start. From this question and your related questions, it sounds like you might benefit from reading more about computational linguistics before you get started with your task. I'd suggest Speech and Language Processing by Jurafsky and Martin: http://www.cs.colorado.edu/~martin/slp.html — aab, May 02 '11 at 08:00

score 1 · Answer 3 · answered Jan 11 '14 at 11:28

1

See Illinois-Curator: http://cogcomp.cs.illinois.edu/page/software_view/Curator

Demo: http://cogcomp.cs.illinois.edu/curator/demo/

It gives you almost everything at one place.

answered Jan 11 '14 at 11:28

Daniel

5,839
9
46
85

score 0 · Answer 4 · answered Apr 29 '11 at 15:31

0

The most popular are:

GATE: easy to use and fairly quick to start with
UIMA: slow learning curve but more efficient and more generic

answered Apr 29 '11 at 15:31

Robert Bossy

345
1
4

Can you give me some walkthrough with gate? – shababhsiddique Apr 29 '11 at 18:31
I suggest you start with using the GATE GUI by following the user guide. There's also a quick start guide. This will allow you to get a grip on GATE basics. Then you may use the API (there are Javadocs and code examples). – Robert Bossy May 03 '11 at 09:41

Simple Natural Language Processing Startup for Java

4 Answers4

Linked