The problem
Sadly, you can't make the tokenizer keep the blanklines, not with the way the it is written.
Starting here and following the function calls through span_tokenize() and _slices_from_text(), you can see there is a condition
if match.group('next_tok'):
that is designed to ensure the tokenizer skips whitespace until the next possible sentence starting token occurs. Looking for the regex this refers to, we end up looking at _period_context_fmt, where we see that the next_tok
named group is preceded by \s+
, where blanklines won't be captured.
The solution
Break it down, change the part that you don't like, reassemble your custom solution.
Now this regex is in the PunktLanguageVars class, itself used to initialize the PunktSentenceTokenizer class. We just have to derive a custom class from PunktLanguageVars and fix the regex the way we want it to be.
The fix we want is to include trailing newlines at the end of a sentence, so I suggest replacing the _period_context_fmt
, going from this:
_period_context_fmt = r"""
\S* # some word material
%(SentEndChars)s # a potential sentence ending
(?=(?P<after_tok>
%(NonWord)s # either other punctuation
|
\s+(?P<next_tok>\S+) # or whitespace and some other token
))"""
to this:
_period_context_fmt = r"""
\S* # some word material
%(SentEndChars)s # a potential sentence ending
\s* # <-- THIS is what I changed
(?=(?P<after_tok>
%(NonWord)s # either other punctuation
|
(?P<next_tok>\S+) # <-- Normally you would have \s+ here
))"""
Now a tokenizer using this regex instead of the older will include 0 or more \s
characters after the end of a sentence.
The whole script
import nltk.tokenize.punkt as pkt
class CustomLanguageVars(pkt.PunktLanguageVars):
_period_context_fmt = r"""
\S* # some word material
%(SentEndChars)s # a potential sentence ending
\s* # <-- THIS is what I changed
(?=(?P<after_tok>
%(NonWord)s # either other punctuation
|
(?P<next_tok>\S+) # <-- Normally you would have \s+ here
))"""
custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
print(custom_tknzr.tokenize(s))
This outputs:
['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']