How to prepare training data for OpenNLP to Tokenize the token that contains more than one word?

Question

In some language (for example: Vietnamese), some vocabulary consists of multiple words. So that some tokens which contain more than one word can be tokenized not just using the white space.

I have following input:

Người dân địa phương đã nhiều lần báo Điện lực Bến Tre nhưng chưa được giải quyết .

Expected output:

["Người dân", "địa phương",  "đã", "nhiều", "lần", "báo", "Điện lực",  "Bến Tre", "nhưng", "chưa", "được", "giải quyết"]

Training data I have _ connect the word that need to stick together in one token:

Người_dân địa_phương đã nhiều lần báo Điện_lực Bến_Tre nhưng chưa được giải_quyết .

Here is command line I use to train

opennlp TokenizerTrainer -model "model/vi-token.bin" -alphaNumOpt 1 -lang "vi" -data "data/merge_vlsp_removehtml" -encoding "UTF-8" -params param/wordseg.param

with param

Iterations=1000

However, the output cannot connect multiple word in one token but it split by whitespace.

Command I run to get output

opennlp TokenizerME model/vi-token.bin < sample/sample_text > sample/sample_text.out

What should I do with training data our config param to train the tokenizer with multiple word each token ?

score 0 · Answer 1 · answered Jul 25 '18 at 13:39

Rather than using the underscore for training, use tags. OpenNLP uses tags as the reference for training. Follow the instructions given for NER and training your Tokenizer.

opennlp provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one sentence per line. You can also specify tokens either separated by a whitespace or by a special tag.

you can follow this blog for head start in opennlp for various purposes. The post will show you how to create a training file and build a new model.

You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.

you can find some help using modelbuilder addon here.

It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!

Also, follow mr. markg's answer to get an understanding on creating new models on your own. This will help you build your own models which can be customized for your applications.

Hope this helps!

How to prepare training data for OpenNLP to Tokenize the token that contains more than one word?

1 Answers1