nltk sentence tokenizer not working correctly

Question

nltk.tokenize.sent_tokenize is aggressively tokenizing sentences at all periods but not all periods mark the end of sentences.

Here's one cooked up sentence that is being incorrectly broken into many sentences:

(see e.g. [5]), real-time i.e. reasoning etc. should be mentioned in ABC et al. for

>>> ['(see e.g.', '[5]), real-time i.e.', 'reasoning etc.', 'should be mentioned in ABC et al.', 'for']

My requirement is to prevent the tokenizer to break at certain words like e.g., i.e., etc., et al.. Is there any way to handle this using nltk?

Update: Adding the above desirable abbreviations to the PunktSentenceTokenizer abbreviation, doesn't help at all. I still get the same result.

Here's the code snippet that I tried:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['et al.', 'i.e.', 'e.g.', 'etc.']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('(see e.g. [5]), real-time i.e. reasoning etc. should be mentioned in ABC et al. for ')

Result:
['(see e.g.', '[5]), real-time i.e.', 'reasoning etc.', 'should be mentioned in ABC et al.', 'for']

See [How can I split a text into sentences?](https://stackoverflow.com/questions/4576077/), I doubt you can get good results with just nltk. You can have better luck with `spacy`, it returns `['(see e.g. [5]), real-time i.e. reasoning etc.', 'should be mentioned in ABC et al.', 'for']`, see [this answer](https://stackoverflow.com/a/66009264/3832970). — Wiktor Stribiżew, Mar 15 '21 at 21:16
Check the Punkt tokenizer and use its abbreviation list: https://stackoverflow.com/questions/34805790/how-to-avoid-nltks-sentence-tokenizer-splitting-on-abbreviations — amiasato, Mar 15 '21 at 21:19
@amiasato, Punkt tokenizer abbreviation doesn't work at all. It returns the same answer as mine. — kkgarg, Mar 15 '21 at 21:33
@WiktorStribiżew, the first link suggests ```nltk```. Whereas ```spacy``` fails at ```et al.``` which is one of the really desirable functionality for me. Can you please reopen this? — kkgarg, Mar 15 '21 at 21:36
In order for the question to get into reopen queue, you need to update it with the code you tried. — Wiktor Stribiżew, Mar 15 '21 at 21:45
Remove the ending dots and don't use spaces: `['al', 'i.e', 'e.g', 'etc']` works fine. — amiasato, Mar 15 '21 at 21:58
BTW, I have one more workaround... Use regex to identify the abbreviations and then replace them using ```re.sub``` — kkgarg, Mar 15 '21 at 22:02

nltk sentence tokenizer not working correctly

0 Answers0