0

nltk.tokenize.sent_tokenize is aggressively tokenizing sentences at all periods but not all periods mark the end of sentences.

Here's one cooked up sentence that is being incorrectly broken into many sentences:

(see e.g. [5]), real-time i.e. reasoning etc. should be mentioned in ABC et al. for

>>> ['(see e.g.', '[5]), real-time i.e.', 'reasoning etc.', 'should be mentioned in ABC et al.', 'for']

My requirement is to prevent the tokenizer to break at certain words like e.g., i.e., etc., et al.. Is there any way to handle this using nltk?

Update: Adding the above desirable abbreviations to the PunktSentenceTokenizer abbreviation, doesn't help at all. I still get the same result.

Here's the code snippet that I tried:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['et al.', 'i.e.', 'e.g.', 'etc.']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('(see e.g. [5]), real-time i.e. reasoning etc. should be mentioned in ABC et al. for ')

Result:
['(see e.g.', '[5]), real-time i.e.', 'reasoning etc.', 'should be mentioned in ABC et al.', 'for']
kkgarg
  • 1,246
  • 1
  • 12
  • 28
  • See [How can I split a text into sentences?](https://stackoverflow.com/questions/4576077/), I doubt you can get good results with just nltk. You can have better luck with `spacy`, it returns `['(see e.g. [5]), real-time i.e. reasoning etc.', 'should be mentioned in ABC et al.', 'for']`, see [this answer](https://stackoverflow.com/a/66009264/3832970). – Wiktor Stribiżew Mar 15 '21 at 21:16
  • Check the Punkt tokenizer and use its abbreviation list: https://stackoverflow.com/questions/34805790/how-to-avoid-nltks-sentence-tokenizer-splitting-on-abbreviations – amiasato Mar 15 '21 at 21:19
  • @amiasato, Punkt tokenizer abbreviation doesn't work at all. It returns the same answer as mine. – kkgarg Mar 15 '21 at 21:33
  • @WiktorStribiżew, the first link suggests ```nltk```. Whereas ```spacy``` fails at ```et al.``` which is one of the really desirable functionality for me. Can you please reopen this? – kkgarg Mar 15 '21 at 21:36
  • In order for the question to get into reopen queue, you need to update it with the code you tried. – Wiktor Stribiżew Mar 15 '21 at 21:45
  • Updated with the code snippet I tried. – kkgarg Mar 15 '21 at 21:50
  • 1
    Remove the ending dots and don't use spaces: `['al', 'i.e', 'e.g', 'etc']` works fine. – amiasato Mar 15 '21 at 21:58
  • BTW, I have one more workaround... Use regex to identify the abbreviations and then replace them using ```re.sub``` – kkgarg Mar 15 '21 at 22:02

0 Answers0