I would like to use the PunktSentenceTokenizer
to split up German texts into sentences. As the pretrained model stumbles upon some abbreviations (e.g. z. B.
), I would like to configure those abbreviations into the tokenizer.
I cannot find a way to specify both the language (ergo use the pretrained model) and use a custom abbreviation list.
Here are the code samples which work, but not combined:
Default German tokenizer:
nltk.sent_tokenize('Das ist z. B. ein Vogel.', language='german')
Custom tokenizer with abbreviation list, but without German model:
punkt_parameters = PunktParameters()
abbreviations = ["z. B."]
punkt_parameters.abbrev_types = set(abbreviations)
tokenizer = PunktSentenceTokenizer(punkt_parameters)
split_sentences = tokenizer.tokenize('Das ist z. B. ein Vogel.')
I cannot find any option to combine those two. Is there any possibility to achieve this or is this impossible (e.g. the model is immutable)?