Function to return sentences under a given character count

Question

Let us suppose I have the following paragraph:

"This is the first sentence. This is the second sentence? This is the third
 sentence!"

I need to create a function that will only return the number of sentences under a given character count. If it is less than one sentence, it will return all characters of the first sentence.

For example:

>>> reduce_paragraph(100)
"This is the first sentence. This is the second sentence? This is the third
 sentence!"

>>> reduce_paragraph(80)
"This is the first sentence. This is the second sentence?"

>>> reduce_paragraph(50)
"This is the first sentence."

>>> reduce_paragraph(5)
"This "

I started off with something like this, but I can't seem to figure out how to finish it:

endsentence = ".?!"
sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
for number,(truth, sentence) in enumerate(sentences):
    if truth:
        first_sentence = previous+''.join(sentence).replace('\n',' ')
    previous = ''.join(sentence)

What should happen to "Hello Mr. Smith."? Should the dot after `Mr.` be interpreted as the end of a sentence? Why not use an existing library that can parse text into sentences instead of rolling your own? — Mark Byers, Aug 19 '12 at 22:23
@MarkByers, I would be in heaven if people still realised that Mr. was an abbreviation but unfortunately I don't think it's that common any more. — Ben, Aug 19 '12 at 22:25
I would be glad to use an existing library if there is one for doing this. — David542, Aug 19 '12 at 22:26
you should look at http://nltk.org for a better way to break up sentences — John La Rooy, Aug 19 '12 at 22:26
@David542: you accepted an `ntlk.tokenize` answer to a [previous question](http://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python): didn't it work for you? — DSM, Aug 19 '12 at 22:41
@DSM - that's what I decided to use for this one. I'll post my answer below. — David542, Aug 20 '12 at 00:01

BigHandsome · Answer 1 · 2012-08-19T22:45:50.510

6

Processing sentences is very difficult to do, due to the syntactical constructs of the English language. As someone has already pointed out, issues like abbreviation will cause unending headaches even for the best regexer.

You should consider the Natural Laungauge Toolkit. Specifically the punkt module. It is a sentence tokenizer and it will do the heavy lifting for you.

edited Aug 19 '12 at 22:45

answered Aug 19 '12 at 22:40

BigHandsome

4,843
5
23
30

score 2 · Accepted Answer · edited May 23 '17 at 10:34

Here's how you could use the punkt module mentioned by @BigHandsome to truncate the paragraph:

from nltk.tokenize.punkt import PunktSentenceTokenizer

def truncate_paragraph(text, maxnchars,
                       tokenize=PunktSentenceTokenizer().span_tokenize):
    """Truncate the text to at most maxnchars number of characters.

    The result contains only full sentences unless maxnchars is less
    than the first sentence length.
    """
    sentence_boundaries = tokenize(text)
    last = None
    for start_unused, end in sentence_boundaries:
        if end > maxnchars:
            break
        last = end
    return text[:last] if last is not None else text[:maxnchars]

Example

text = ("This is the first sentence. This is the second sentence? "
        "This is the third\n sentence!")
for limit in [100, 80, 50, 5]:
    print(truncate_paragraph(text, limit))

Output

This is the first sentence. This is the second sentence? This is the third
 sentence!
This is the first sentence. This is the second sentence?
This is the first sentence.
This

cmh · Answer 3 · 2012-08-19T23:40:26.007

If we ignore the natural language issues (i.e. an algorithm to return complete chunks deliniated by ".?!", where the sum is less than k) then the following elementary approach will work:

def sentences_upto(paragraph, k):
    sentences = []
    current_sentence = ""
    stop_chars = ".?!"
    for i, c in enumerate(paragraph):
        current_sentence += c
        if(c in stop_chars):
            sentences.append(current_sentence)
            current_sentence = ""
        if(i == k):
            break
    return sentences
        return sentences

Your itertools solution can be completed like this:

def sentences_upto_2(paragraph, size):
    stop_chars = ".?!"
    sentences = itertools.groupby(paragraph, lambda x: any(x.endswith(punct) for punct in stop_chars))  
    for k, s in sentences:
        ss = "".join(s)
        size -= len(ss)
        if not k:
            if size < 0:
                return
            yield ss

Running the above example into the `sentences_upto` function results in an `AttributeError`. — David542, Aug 19 '12 at 23:04

score 0 · Answer 4 · answered Aug 20 '12 at 00:02

You can break down this problem into simpler steps:

Given a paragraph, split it into sentences
Figure out how many sentences we can join together while staying under the character limit
If we can fit at least one sentence, then join those sentences together.
If the first sentence was too long, take the first sentence and truncate it.

Sample code (not tested):

    def reduce_paragraph(para, max_len):
        # Split into list of sentences
        # A sentence is a sequence of characters ending with ".", "?", or "!".
        sentences = re.split(r"(?<=[\.?!])", para)

        # Figure out how many sentences we can have and stay under max_len
        num_sentences = 0
        total_len = 0
        for s in sentences:
            total_len += len(s)
            if total_len > max_len:
                break
            num_sentences += 1

        if num_sentences > 0:
            # We can fit at least one sentence, so return whole sentences
            return ''.join(sentences[:num_sentences])
        else:
            # Return a truncated first sentence
            return sentences[0][:max_len]

Function to return sentences under a given character count

4 Answers4

Example

Output

Linked