2

I am using OpenNLP Token Name finder for parsing the Unstructured data, I have created a corpus(training set) of 4MM records but as I am creating a model out of this corpus using OpenNLP API's in Eclipse, process is taking around 3 hrs which is very time consuming. Model is building on default parameters that is iteration 100 and cutoff 5.

So my question is, how can I speed up this process, how can I reduce the time taken by the process for building the model.

Size of the corpus could be the reason for this but just wanted to know if someone came across this kind of problem and if so, then how to solve this.

Please provide some clue.

Thanks in advance!

Nikhil Jain
  • 8,232
  • 2
  • 25
  • 47
  • Try jvm memory parameters: `-Xms=512m -Xmx=2048m` – Ramanan Nov 19 '14 at 05:26
  • Thanks for suggesting this but you know what I have already increased the -Xmx to 10GB as the process is taking around 10GB space. After increasing the memory, still it is taking 3 hours. That is why i am bit concerned. – Nikhil Jain Nov 19 '14 at 05:36
  • No other way to speed up the process. Exporting it as jar file and running it may give you extra ~500mb (which eclipse takes). Is that 4 million records? I guess GATE (https://gate.ac.uk/) will take even more time than this. – Ramanan Nov 19 '14 at 05:43
  • ok i will try to export the project as jar file and run it on command prompt. Yes corpus contains 4 million records. Do you have any idea how can i run this on Spark for speeding up the process. – Nikhil Jain Nov 19 '14 at 05:49
  • Going to a scalable, distributed solution (like Apache Spark) is probably the right idea. I'm not sure what sort of model you are building, but Spark's MLlib supports a number of types. https://spark.apache.org/docs/1.1.0/mllib-guide.html – Daniel Darabos Nov 25 '14 at 10:50
  • I know Spark's MLib is one option but I am working on OpenNLP for a long time and i do not want to switch from openNLP to Mlib. I found on web that it is possible to integrate openNLP with Spark using UIMAFit but did not find good examples. – Nikhil Jain Nov 28 '14 at 12:16

2 Answers2

4

Usually the first approach to handle such issues is to split the training data to several chunks, and let each one to create a model of its own. Afterwards you merge the models. I am not sure that this is valid in this case (I'm not an OpenNLP expert), there's another solution below. Also, as it seems that the OpenNLP API provides only a single threaded train() methods, I would file an issue requesting a multi threaded option.

For a slow single threaded operation the two main slowing factors are IO and CPU, and both can be handled separately:

  • IO - which hard drive do you use? Regular (magnetic) or SSD? moving to SSD should help.
  • CPU - which CPU are you using? moving to a faster CPU will help. Don't pay attention to the number of cores, as here you want the raw speed.

An option you may want to consider to to get an high CPU server from Amazon web services or Google Compute Engine and run the training there - you can download the model afterwards. Both give you high CPU servers utilizing Xeon (Sandy Bridge or Ivy Bridge) CPUs and local SSD storage.

David Rabinowitz
  • 29,904
  • 14
  • 93
  • 125
  • Is it possible to create small models and later merge them to one? I am curious to know which library supports this? – Nikhil Jain Nov 28 '14 at 12:18
  • Hi @Nikhil, as I've wrote, I'm not an NLP expert so it was just an idea. As you can see from my answer, I don't think it's possible so I've listed other options. – David Rabinowitz Nov 28 '14 at 12:28
  • Hi David, I have increased my RAM, after that model is creating much faster than before(Taking 50 min on 3MM records). Moreover train() method also provides multi-threading so I am adding some threads say 10 which also improving my model performance. Thanks for the nice suggestions. – Nikhil Jain Dec 01 '14 at 12:31
  • 1
    I worked with NLP and OpenNLP long enough to tell you this, it's not possible to combine smaller models to build a bigger one. Some of the smaller models may contain features that other models haven't even seen. The last step in training is to assign weights to all features, likelihood that the event occurs given a particular feature occurred. The only option is to select the best performing model of the many smaller models and merging is not an option. – Vihari Piratla Dec 24 '14 at 17:35
3

I think you should make algorithm related changes before upgrading the hardware.
Reducing the sentence size
Make sure you don't have unnecessarily long sentences in the training sample. Such sentences don't increase the performance but have a huge impact on computation. (Not sure of the order) I generally put a cutoff at 200 words/sentence. Also look at the features closely, these are the default feature generators
two kinds of WindowFeatureGenerator with a default window size of only two OutcomePriorFeatureGenerator PreviousMapFeatureGenerator BigramNameFeatureGenerator SentenceFeatureGenerator
These features generators generate the following features in the given sentence for the word: Robert.

Sentence: Robert, creeley authored many books such as Life and Death, Echoes and Windows.
Features:
w=robert
n1w=creeley
n2w=authored
wc=ic
w&c=robert,ic
n1wc=lc
n1w&c=creeley,lc
n2wc=lc
n2w&c=authored,lc
def
pd=null
w,nw=Robert,creeley
wc,nc=ic,lc
S=begin


ic is Initial Capital, lc is lower case

Of these features S=begin is the only sentence dependant feature, which marks that Robert occurred in the start of the sentence.
My point is to explain the role of a complete sentence in training. You can actually drop the SentenceFeatureGenerator and reduce the sentence size further to only accomodate few words in the window of the desired entity. This will work just as well.
I am sure this will have a huge impact on complexity and very little on performace.

Have you considered sampling?
As I have described above, the features are very sparse representation of the context. May be you have many sentences with duplicates, as seen by the feature generators. Try to detect these and sample in a way to represent sentences with diverse patterns, ie. it should be impossible to write only a few regular expressions that matches them all. In my experience, training samples with diverse patterns did better than those that represent only a few patterns, even though the former had a much smaller number of sentences. Sampling this way should not affect the model performance at all.

Thank you.

Vihari Piratla
  • 8,308
  • 4
  • 20
  • 26
  • Thanks Vihari for the answer. I understand your point but how can I drop the Sentence Feature Generator? how can I see the features, does OpenNLP provide any command line utility for this? can you please share how did you generate the features for the above sentence, once I will get the features generators then I will try to identify duplicates. It might be possible to have duplicates in the training set as I am creating it programmatically. – Nikhil Jain Jan 06 '15 at 06:56
  • Sentence Feature Generator is just a fancy sentence tokeniser, it does not affect performance in any way. OpenNLP does not provide a command line utility to see the features; I added few print statements in their code (dont remember where). As I said in the answer, the features depend on only a small window of an entity (OpenNLP calls them events), identify such entities, preserve just the window around them and emit such sub-sentences as sentences. Or try the second method (sampling), which is easier (as you just have to fit few bunch of regex's) and I am more confident of. – Vihari Piratla Jan 06 '15 at 08:28
  • Hi David, is there any option of integrating open nlp with spark to parallelize the parsing process? – Praveen Kumar K S Jul 20 '16 at 15:57