I am new to NLP and I came across OpenNLP. From my understanding tokenization
means segmenting text into words and sentences. Words are often separated by white spaces but not all white spaces are equal. For example Los Angeles in an individual thought regardless of the white space. But whenever I run the OpenNLP Tokenizer, it creates two distinct tokens for Los Angeles: Los & Angeles. Here is my code (I got the model en-token.bin from the old OpenNLP site).
InputStream inputStream = new FileInputStream("C:\\apache-opennlp-1.9.0\\Models\\en-token.bin");
TokenizerModel tokenModel = new TokenizerModel(inputStream);
//Instantiating the TokenizerME class
TokenizerME tokenizer = new TokenizerME(tokenModel);
String tokens[] = tokenizer.tokenize(sentence2);
for(String token : tokens) {
System.out.println(token);
}
Here is the output:
The
city
of
Los
Angeles
is
one
of
the
most
beautiful
places
in
California
I tested some other tokenizers online and they produce the same output. If it is not tokenization, what would be the process to identify these two words belong together?