OpenNLP Tokenizer does not detect words that belong together?

Question

I am new to NLP and I came across OpenNLP. From my understanding tokenization means segmenting text into words and sentences. Words are often separated by white spaces but not all white spaces are equal. For example Los Angeles in an individual thought regardless of the white space. But whenever I run the OpenNLP Tokenizer, it creates two distinct tokens for Los Angeles: Los & Angeles. Here is my code (I got the model en-token.bin from the old OpenNLP site).

InputStream inputStream = new FileInputStream("C:\\apache-opennlp-1.9.0\\Models\\en-token.bin"); 
TokenizerModel tokenModel = new TokenizerModel(inputStream);
//Instantiating the TokenizerME class 
TokenizerME tokenizer = new TokenizerME(tokenModel);

String tokens[] = tokenizer.tokenize(sentence2);

for(String token : tokens) {
     System.out.println(token);
}

Here is the output:

The
city
of
Los
Angeles
is
one
of
the
most
beautiful
places
in
California

I tested some other tokenizers online and they produce the same output. If it is not tokenization, what would be the process to identify these two words belong together?

score 0 · Answer 1 · edited Jul 12 '18 at 05:03

Basic word tokenizers presented online usually split sentences into words by seeing a space. To tackle the cases where words with a physical meaning (like Los Angeles or New Delhi, etc.) should not get split, we can write a function on our own that, after splitting words, looks into some kind of dictionary that contains words with a physical meaning and tries to combine those words into a single group.

score 0 · Answer 2 · answered Jul 25 '18 at 13:22

opennlp provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one sentence per line. You can also specify tokens either separated by a whitespace or by a special tag.

you can follow this blog for head start in opennlp for various purposes. The post will show you how to create a training file and build a new model.

You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.

you can find some help using modelbuilder addon here.

It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!

Also, follow mr. markg's answer to get an understanding on creating new models on your own. This will help you build your own models which can be customized for your applications.

Hope this helps!

score -1 · Answer 3 · answered Jul 26 '18 at 05:23

As you noticed English tokenizers basically always treat spaces as dividing words. You could have a smart tokenizer that checks if a splace should be split on, but usually in an NLP pipeline (for languages that use spaces) the tokenizer is as simple as possible. Joining together words that have been separated would be a separate step after tokenization.

Finding words that should be joined, like "Los Angeles", is usually called phrase detection or collocation detection. The phrases themselves are often called multi-word expressions in academic writing. OpenNLP doesn't seem to have any MWE-related functionality, but Gensim has an easy-to-use phrases function, and they link to the papers that their implementation is based on. Stanford's CoreNLP also has a library for the task.

OpenNLP Tokenizer does not detect words that belong together?

3 Answers3