I'm stuck on a problem here. I'm going to use spacy's word tokenizer. But I have some constraints, e.g. that my tokenizer doesn't splits words that contain apostrophs (').
Example:
Input string : "I can't do this" current output: ["I","ca","n't","do","this"] Expected output: ["I","can't","do","this"]
My Tries:
doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and "'" in token.text]
with doc.retokenize() as retokenizer:
for pos in position:
retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
print(token.text)
In this way I'm getting the expected output. But I don't know if this approach is right? Or else is there a better approach to do retokenization?