How to get training dataset of OpenNLP models?

Question

I am using the following models of OpenNLP:

en-parser-chunking.bin
en-ner-person.bin
en-ner-location.bin
en-ner-organization.bin

I want to append my data in the training dataset on which these models are trained. So please tell me from where I can get that raw dataset?

score 0 · Answer 1 · answered Feb 02 '17 at 13:41

0

The section Chunker Training of the official OpenNLP manual mentions a reference to the raw data used for the training of the EN language model files:

The training data can be converted to the OpenNLP chunker training format, that is based on CoNLL2000.

You will also find other references, e.g, Chapter 12. Corpora, to external resources used in/for OpenNLP.

Additionally, the CoNLL2003 corpus might be of interest:

The English data is the Reuters Corpus, which is a collection of news wire articles. The Reuters Corpus can be obtained free of charges from the NIST for research purposes: http://trec.nist.gov/data/reuters/reuters.html

Hope it helps.

answered Feb 02 '17 at 13:41

MWiesner

8,868
11
36
70

Can you please help me for how to train the existing NER models by my own sample data using the openNLP API? – Madhvi Gupta Feb 07 '17 at 06:26
I'm afraid that this is another question on StOf. I provided you an answer to the question "from where I can get that raw dataset". IMHO it gives a valid and acceptable answer. You might consider asking a new, separate question, and I'll have a look at it. – MWiesner Feb 07 '17 at 08:51
I have got the reuters dataset but now I just want to know how to proceed with that, appending my own data into it. – Madhvi Gupta Feb 16 '17 at 07:36

score 0 · Answer 2 · edited May 23 '17 at 12:17

0

there are addons available for that. Use this modelbuilder-addon to update the existing NER model and also to create a new one in a faster way.

What the code in the link does is read in your sentences, uses the default en-ner-person model to do it's best. Then it writes those results to a file of the good hits, and a file of the bad hits . Then it feeds those files into the "modelbuilder-addon" call at the bottom.

Hope this helps!

edited May 23 '17 at 12:17

Community

1
1

answered Feb 11 '17 at 08:26

iamgr007

966
1
8
28

hey! is there any examples how to use it to update existing models of OpenNLP? – Abhishek Sengupta Feb 25 '20 at 14:38
checkout my repository : https://github.com/iamgr007/srae/blob/master/src/training/UpdateExisitingModel.java – iamgr007 Feb 28 '20 at 17:14
hi @iamgr007 ,Thanks will check. But is there maven repo for modelbuilder addon ? – Abhishek Sengupta Mar 02 '20 at 19:25
can you please tell what does getSentencesFromSomewhere() does ? does it brings normal un-annotated sentences to be analyzed for ? another question, if thats so then how many list of sentence is required in that file so that I can make good model out of it ? – Abhishek Sengupta Mar 02 '20 at 19:53
1

@AbhishekSengupta getSentencesFromSomewhere() gets the sentences from the dataset you provide, fully annotated (proper pre processing must be done) and I guess there is no limit for no.of sentences as more data = good model. Anyways, try to build a model with a large and extensive dataset. If something doesn't workout, check opennlp documentation for any limitations. – iamgr007 Apr 13 '20 at 05:42

How to get training dataset of OpenNLP models?

2 Answers2