7

I'm using the NLTK's PUNKT sentence tokenizer to split a file into a list of sentences, and would like to preserve the empty lines within the file:

from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences

I would like this to print:

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

But the content that's actually printed shows that the trailing empty lines have been removed from the first and third sentences:

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

Other tokenizers in NLTK have a blanklines='keep' parameter, but I don't see any such option in the case of the Punkt tokenizer. It's very possible I'm missing something simple. Is there a way to retrain these trailing empty lines using the Punkt sentence tokenizer? I'd be grateful for any insights others can offer!

duhaime
  • 25,611
  • 17
  • 169
  • 224
  • 1
    regardless of NLTK use, you can just pre-split the text on newlines (multiple newlines) and then use NLTK on the resulting chunks – Vsevolod Dyomkin Oct 15 '15 at 07:23
  • @VsevolodDyomkin Interesting idea; in that case, how would one treat sentences that were spread over multiple lines? – duhaime Oct 16 '15 at 01:17
  • for this case it just doesn't work :( – Vsevolod Dyomkin Oct 16 '15 at 10:18
  • Do you specifically need to keep the line breaks, or are you just interested in blank lines denoting paragraph boundaries? (Because if so, there's a simpler solution). – alexis Oct 16 '15 at 21:01
  • Well, because a poem might need a variable number of linebreaks after a stanza (there could be one, two, ... n line breaks), we'd want to preserve the amount of whitespace represented in the poem. That said, I'd be curious to see what you're thinking @alexis... – duhaime Oct 16 '15 at 22:56
  • Answered. Keeping the paragraph breaks is only slightly harder than discarding them. – alexis Oct 16 '15 at 23:52

4 Answers4

13

The problem

Sadly, you can't make the tokenizer keep the blanklines, not with the way the it is written.

Starting here and following the function calls through span_tokenize() and _slices_from_text(), you can see there is a condition

if match.group('next_tok'):

that is designed to ensure the tokenizer skips whitespace until the next possible sentence starting token occurs. Looking for the regex this refers to, we end up looking at _period_context_fmt, where we see that the next_tok named group is preceded by \s+, where blanklines won't be captured.

The solution

Break it down, change the part that you don't like, reassemble your custom solution.

Now this regex is in the PunktLanguageVars class, itself used to initialize the PunktSentenceTokenizer class. We just have to derive a custom class from PunktLanguageVars and fix the regex the way we want it to be.

The fix we want is to include trailing newlines at the end of a sentence, so I suggest replacing the _period_context_fmt, going from this:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        \s+(?P<next_tok>\S+)     # or whitespace and some other token
    ))"""

to this:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    \s*                       #  <-- THIS is what I changed
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
    ))"""

Now a tokenizer using this regex instead of the older will include 0 or more \s characters after the end of a sentence.

The whole script

import nltk.tokenize.punkt as pkt

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"

print(custom_tknzr.tokenize(s))

This outputs:

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']
HugoMailhot
  • 1,275
  • 1
  • 10
  • 19
  • @duhaime, I changed my solution script to something non-redundant. Since all we want is to redefine the regex, there is no need to redefine the method using it as well. Cheers! – HugoMailhot Oct 16 '15 at 01:06
  • this is absolutely perfect. Your snippet taught me quite a bit about inheritance in NLTK. Thank you! – duhaime Oct 16 '15 at 01:11
  • @HugoMailhot sorry to bother you after these years but I'm still facing same problem even after following your solution, it didn't work forme! also here http://www.nltk.org/api/nltk.tokenize.html?highlight=split%20sentence#module-nltk.tokenize.punkt they mentioned that newline are retained after tokenizing text! – SlimenTN May 11 '21 at 07:13
  • @SlimenTN Could you open a question with your specific problem, link to this question here, and explain how this solution doesn't solve your problem? It would be helpful if you could describe your case, what you expect, and what you're getting instead. Feel free to link to it in this comment thread so I can find it easily. – HugoMailhot May 11 '21 at 13:40
  • @HugoMailhot thanks for your response, it actually worked :) it was my bad I did something wrong. thanks again :) – SlimenTN May 11 '21 at 13:53
  • Glad to hear it! – HugoMailhot May 11 '21 at 15:03
1

Split the input into paragraphs, splitting on a capturing regexp (which returns the captured string as well):

paras = re.split("(\n\s*\n)", sentences)

You can then apply nltk.sent_tokenize() to the individual paragraphs, and process the results by paragraph or flatten the list-- whatever best suits your further use.

sents_by_para = [ nltk.sent_tokenize(p) for p in paras ]
flat = [ sent for par in sents_by_para for sent in par ]

(It seems that sent_tokenize() doesn't mangle whitespace-only strings, so there's no need to check and exclude them from processing.)

If you specifically want to have the whitespace attached to the previous sentence, you can easily stick it back on:

collapsed = []
for s in flat:
    if s.isspace() and len(collapsed) > 0:
        collapsed[-1] += s
    else:
        collapsed.append(s)
alexis
  • 48,685
  • 16
  • 101
  • 161
1

In the end, I ended up combining insights from both @alexis and @HugoMailhot so that I could preserve linebreaks in cases where a single paragraph has multiple sentences and/or linebreaks:

import re, nltk, sys, codecs
import nltk.tokenize.punkt as pkt
from nltk import data

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tokenizer = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

def sentence_split(s):
        '''Read in a string and return a list of sentences with linebreaks intact'''
        paras = re.split("(\n\s*\n)", s)
        sents_by_para = [custom_tokenizer.tokenize(p) for p in paras ]
        flat = [ sent for par in sents_by_para for sent in par ]

        collapsed = []
        for s in flat:
            if s.isspace() and len(collapsed) > 0:
                collapsed[-1] += s
            else:
                collapsed.append(s)

        return collapsed

if __name__ == "__main__":
        s = codecs.open(sys.argv[1],'r','utf-8').read()
        sentences = sentence_split(s)
duhaime
  • 25,611
  • 17
  • 169
  • 224
0

I would go with itertools.groupby, see Python: How to loop through blocks of lines:

alvas@ubi:~$ echo """This is a foo bar sentence,
that is also a foo bar sentence.

But I don't like foobars.
Yes you do like bars with foos, no?


I'm not sure whether you like bar bar!
Neither do I like black sheep.""" > test.in



alvas@ubi:~$ python
>>> from nltk import sent_tokenize
>>> import itertools
>>> with open('test.in', 'r') as fin:
...     for key, group in itertools.groupby(fin, lambda x: x!='\n'):
...             if key:
...                     print list(group)
... 
['This is a foo bar sentence,\n', 'that is also a foo bar sentence.\n']
["But I don't like foobars.\n", 'Yes you do like bars with foos, no?\n']
["I'm not sure whether you like bar bar!\n", 'Neither do I like black sheep.\n']

And after that if you want to do a sent_tokenize or other punkt models within the group:

>>> with open('test.in', 'r') as fin:
...     for key, group in itertools.groupby(fin, lambda x: x!='\n'):
...             if key:
...                     paragraph = " ".join(line.strip() for line in group)
...                     print sent_tokenize(paragraph)
... 
['This is a foo bar sentence, that is also a foo bar sentence.']
["But I don't like foobars.", 'Yes you do like bars with foos, no?']
["I'm not sure whether you like bar bar!", 'Neither do I like black sheep.']

(Note: the more computationally efficient method would be to use mmap, see https://stackoverflow.com/a/3915398/610569 . But for the size I work on (~20 million tokens) itertools.groupby was sufficient)

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thanks @alvas, but your sentence tokenized output doesn't appear to retain the line breaks :/ – duhaime Oct 15 '15 at 23:03
  • My solution sort of changes the breaks into groups to match for empty lines. Because at the end, i think `\n\n` vs `\n\n\n` would have been the same, unless it's different, retaining the breaks might not be worth the effort =) @HugoMailhot answer to hack the punkt tokenizer would be a better solution if `\n\n` and `[\n].*` makes a different in your text. – alvas Oct 16 '15 at 10:44
  • 1
    Thanks @alvas! I'm working with poetry and need to be concerned about displaying the poetry properly, so I need to keep track of all the `\n` in the file. Thanks again for following up on this! – duhaime Oct 16 '15 at 19:17
  • Ah, now i understand why you might need the `[\n].*`. – alvas Oct 17 '15 at 08:21