Configure PunktSentenceTokenizer and specify language

Question

I would like to use the PunktSentenceTokenizer to split up German texts into sentences. As the pretrained model stumbles upon some abbreviations (e.g. z. B.), I would like to configure those abbreviations into the tokenizer.

I cannot find a way to specify both the language (ergo use the pretrained model) and use a custom abbreviation list.

Here are the code samples which work, but not combined:

Default German tokenizer:

nltk.sent_tokenize('Das ist z. B. ein Vogel.', language='german')

Custom tokenizer with abbreviation list, but without German model:

punkt_parameters = PunktParameters()
abbreviations = ["z. B."]
punkt_parameters.abbrev_types = set(abbreviations)
tokenizer = PunktSentenceTokenizer(punkt_parameters)
split_sentences = tokenizer.tokenize('Das ist z. B. ein Vogel.')

I cannot find any option to combine those two. Is there any possibility to achieve this or is this impossible (e.g. the model is immutable)?

guerda · Accepted Answer · 2021-10-27T07:57:43.043

Based on Josh's answer here: https://stackoverflow.com/a/25375857/32043

additional_abbreviations = ["z.B", "z.b", "ca", "dt"]
sentence_tokenizer = nltk.data.load("tokenizers/punkt/german.pickle")
sentence_tokenizer._params.abbrev_types.update(additional_abbreviations)
split_sentences = sentence_tokenizer.tokenize("Das ist z.B. ein Vogel. Das ist dt. Geschichte. Das sind ca. 2 kg.")

The additional abbreviations may not end with a dot. If you have abbreviations with a blank, this is not going to work.

Configure PunktSentenceTokenizer and specify language

1 Answers1