I've written a prototype in Python that uses the NLTK
package to perform 3 NLP tasks:
- text normalization (split text into words, remove punctuation and other crud, convert words to base forms)
- train and use IBM Translation Model 1
- train and use Okapi BM25 model to evaluate relevance of queries
I now need to port this into Java and am looking for existing implementations of the 3 tasks.
For the base form conversion subtask of #1, I would like to be able to supply a specialized dictionary to help better process text from a specialized domain that I am dealing with. But if that's not possible, using whatever default is fine too.
Performance is important. The python version is a prototype but the Java port has to work in production. The main requirement is scalability in terms of speed for large volumes of data. The prod machines have lots of RAM, so that's less of a concern.
Any recommendations? I can use CoreNLP
or OpenNLP
for #1 but what about #2 and 3?