nltk.tokenize.sent_tokenize
is aggressively tokenizing sentences at all periods but not all periods mark the end of sentences.
Here's one cooked up sentence that is being incorrectly broken into many sentences:
(see e.g. [5]), real-time i.e. reasoning etc. should be mentioned in ABC et al. for
>>> ['(see e.g.', '[5]), real-time i.e.', 'reasoning etc.', 'should be mentioned in ABC et al.', 'for']
My requirement is to prevent the tokenizer to break at certain words like e.g., i.e., etc., et al.
. Is there any way to handle this using nltk
?
Update: Adding the above desirable abbreviations to the PunktSentenceTokenizer abbreviation, doesn't help at all. I still get the same result.
Here's the code snippet that I tried:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['et al.', 'i.e.', 'e.g.', 'etc.']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('(see e.g. [5]), real-time i.e. reasoning etc. should be mentioned in ABC et al. for ')
Result:
['(see e.g.', '[5]), real-time i.e.', 'reasoning etc.', 'should be mentioned in ABC et al.', 'for']