1

I've written a prototype in Python that uses the NLTK package to perform 3 NLP tasks:

  1. text normalization (split text into words, remove punctuation and other crud, convert words to base forms)
  2. train and use IBM Translation Model 1
  3. train and use Okapi BM25 model to evaluate relevance of queries

I now need to port this into Java and am looking for existing implementations of the 3 tasks.

For the base form conversion subtask of #1, I would like to be able to supply a specialized dictionary to help better process text from a specialized domain that I am dealing with. But if that's not possible, using whatever default is fine too.

Performance is important. The python version is a prototype but the Java port has to work in production. The main requirement is scalability in terms of speed for large volumes of data. The prod machines have lots of RAM, so that's less of a concern.

Any recommendations? I can use CoreNLP or OpenNLP for #1 but what about #2 and 3?

I Z
  • 5,719
  • 19
  • 53
  • 100
  • If you have it in `NLTK`, then you didn't really implement them but call the functions ;P It isn't hard to re-implement them in Java since you have a skeleton code in python. Do or do not, there's no try =) – alvas Dec 22 '15 at 18:36
  • IBM model 1 is an EM algorithm in general, so try http://alias-i.com/lingpipe/demos/tutorial/em/read-me.html and Lucene has a version of Okapi BM25. Also, see http://stackoverflow.com/questions/22904025/java-or-python-for-natural-language-processing – alvas Dec 22 '15 at 18:38
  • Stanford CoreNLP does not provide either (2) or (3). You can try the Berkeley Aligner for a good word alignment model. It's not Model 1, but it actually should be substantially better. Lucene is probably your best bet for Okapi BM25. – Gabor Angeli Dec 23 '15 at 02:35

0 Answers0