1

Among the open source NLP libraries, is there one whose tokenizer handles missing white spaces? For instance, the phrase "this tokenizer isgreat" would give [this, tokenizer, is, great] instead of [this, tokenizer, isgreat].

Ying Xie
  • 29
  • 3
  • this is a good question, but unfortunately there nothing currently. – Daniel Sep 07 '17 at 22:23
  • but creating it shouldn't be very hard though, if you have a big word index. – Daniel Sep 07 '17 at 22:23
  • 1
    Thanks. Yes I found [this post](https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words) that has the code for adding white spaces. Still hoping there is a tokenizer that does it since I need to tokenize the sentences. – Ying Xie Sep 08 '17 at 01:23
  • Did you find an answer to this? – Judy T Raj Aug 25 '20 at 07:20
  • This library: https://redditscore.readthedocs.io/en/master/index.html does it for hashtags and not for other words. So, Ideally you can change all the tokens to hashtags and use the CrazyTokenizer, or you can modify the source code of the API for your needs. – Ashwin Geet D'Sa Sep 01 '20 at 13:45

0 Answers0