Entities containing underscore character are split into multiple entities by TokensAnnotation in CoreNLP

Question

I am observing that coreNLP 3.9.2 has started splitting enti_ties into multiple ones like 'enti' , '_', 'ties' while tokenizing

I have tried to use the tokenize.whitespace which solves this problem. But I think this will stop splitting tokens for "cant't" and "dont't"

score 1 · Accepted Answer · answered Jan 16 '20 at 04:10

One thing you can do is replace the underscores (_) with a period (.) and the parser (and tokenizer, I believe) will interpret it as one entity.

E.g. enti_ties > enti.ties where the latter is retained as one entity

This doesn't entirely resolve the problem, but serves as a workaround in a pinch.

1 Answers1