5

I have a function and want to return (a) the number of words per sentence and (b) mean length of words per sentence in a list of tuples. I can get (a). For (b) I can get the total number of characters per sentence but not mean.

I've reviewed a few posts (this, that and another) but can't wrap my head around this last piece.

I've included a couple failed attempts commented out.

import statistics

def sentence_num_and_mean(text):
    """ Output list of, per sentence, number of words and mean length of words """
    # Replace ! and ? with .
    for ch in ['!', '?']:
        if ch in text:
            text = text.replace(ch, '.')

    # Number of words per sentence
    num_words_per_sent =  [len(element) for element in (element.split() for element in text.split("."))]

    # Mean length of words per sentence

    # This gets sum of characters per sentence, so on the right track
    mean_len_words_per_sent = [len(w) for w in text.split('.')]

    # This gives me "TypeError: unsupported operand type(s) for /: 'int' and 'list'" error
    # when trying to get the denominator for the mean
    # A couple efforts
    #mean_len_words_per_sent = sum(num_words_per_sent) / [len(w) for w in text.split('.')]
    #mean_len_words_per_sent = [(num_words_per_sent)/statistics.mean([len(w) for w in text.split()])]

    # Return list zipped together
    return list(zip(num_words_per_sent, mean_len_words_per_sent))

Driver program:

split_test = "First sentence ends with a period. Next one ends with a question mark? Another period. Then exclamation! Blah blah blah"
func_test = sentence_num_and_mean(split_test)
print(split_test)
print(func_test)

which prints

First sentence ends with a period. Next one ends with a question mark? Another period. Then exclamation! Blah blah blah
[(6, 33), (7, 35), (2, 15), (2, 17), (3, 15)]

For one, I need to strip out spaces and periods, but ignoring that for now, if I did the simple math right it should be:

[(6, 5.5), (7, 5), (2, 7.5), (2, 8.5), (3, 5)]
md2614
  • 353
  • 2
  • 14
  • 2
    You might find it easier to attack this a bit at a time. For example, write a function that takes the raw text and passes back a list of sentences. Then write another function that takes a sentence and passes back a list of words in that sentence. Then think about functions that can take a list of words and return the word count and mean length. You should then have all the bits you need. – Matthew Strawbridge Mar 16 '20 at 19:15
  • 1
    When you say mean number of characters, do you mean including spaces, or just letters? Because your question implies only letters, but your expected output counts spaces... – Ed Ward Mar 16 '20 at 19:19
  • Well, my example includes spaces and punctuation, but ultimately I would remove them then calculate the mean. – md2614 Mar 16 '20 at 19:20
  • What should be done about "Blah blah blah" it doesn't end in any punctuation but will be a sentence since it is at the end of the split - does it matter? – DaveStSomeWhere Mar 16 '20 at 19:37
  • 1
    I have a possible answer, but you've asked a bunch of questions at once. I split on sentences, then split on words, and generate the (count, avg_len) for that sentence. Perhaps this could be refined to a question about generating the report for one sentence? – Kenny Ostrom Mar 16 '20 at 19:54
  • Right, I have a few ideas combined. The answers treated "Blah blah blah" like a sentence which was fine for this purpose. On a second thought, that could be ignored since there is no period at the end, but not needed at this time. – md2614 Mar 17 '20 at 00:14

5 Answers5

1

Better variable names may help you clarify how to express the ideas. text.split('.') gives you what? A list of sentences (str). If you have a single sentence in a variable called sentence then sentence.split() gives you a list of words (str). With those in mind, this is pretty easy to write.

mean_len_words_per_sent = [statistics.mean(len(word) for word in sentence.split()) for sentence in text.split('.')]

Adam Hoelscher
  • 1,804
  • 2
  • 17
  • 33
0

It's pretty clear that what you're ending up with is the correct number of words per sentence and the expected number of characters per word (before removing whitespace and punctuation). So all you need is the former divided by the latter.

num_words_per_sent =  [len(element) for element in (element.split() for element in text.split("."))]

len_words_per_sent = [len(w) for w in text.split('.')]

return [(num,len_words/num) for num,len_words in zip(num_words_per_sent,len_words_per_sent)]

  • That's not actually what they're getting. They have the number of characters per sentence, including whitespace. – Adam Hoelscher Mar 16 '20 at 19:40
  • Of course. I phrased that badly. I just meant the number of characters they were expecting. I thought the question was only how to get the answer they were expecting given the code they had, not to also solve the separate problem of removing the whitespace, which they might have wanted to give a try themself after they knew how to get the mean. Will edit my answer so that no one is confused. Thanks. – Adam Chambers Mar 16 '20 at 20:09
0

The list mean_len_words_per_sent should probably be num_characters_per_sent as it is currently used.

You can then iterate through the two lists you created and divide the character per sentence through the number of words in the sentence.

mean_len_words_per_sent = [num_chars / num_word for num_chars, num_word in zip(num_characters_per_sent, num_words_per_sent)]
Querenker
  • 2,242
  • 1
  • 18
  • 29
0

If you only want letters, this should work:

def sentence_num_and_mean(text):
    # Replace ! and ? with .
    for ch in ['!', '?']:
        if ch in text:
            text = text.replace(ch, '.')

    output = []
    sentences = text.split(".")
    for sentence in sentences:
        words = [x for x in sentence.split(" ") if x]
        word_count = len(words)
        word_length = sum(map(len, words))
        word_mean = word_length / word_count
        output.append((word_count, word_mean))

    return output


split_test = "First sentence ends with a period. Next one ends with a question mark? Another period. Then exclamation! Blah blah blah"
func_test = sentence_num_and_mean(split_test)
print(split_test)
print(func_test)

Output:

First sentence ends with a period. Next one ends with a question mark? Another period. Then exclamation! Blah blah blah
[(6, 4.666666666666667), (7, 4.0), (2, 6.5), (2, 7.5), (3, 4.0)]
Ed Ward
  • 2,333
  • 2
  • 10
  • 16
0

You can use statistics.mean for computing the average word length. Here you can use map(len, sentence.split()) in order to compute the length of each word.

import statistics

def sentence_num_and_mean(text):
    punctuation = '?!'
    text = text.translate(str.maketrans(dict.fromkeys(punctuation, '.')))
    sentences = text.split('.')
    num_words_per_sent = [len(s.strip().split()) for s in sentences]
    mean_len_words_per_sent = [statistics.mean(map(len, s.strip().split())) for s in sentences]
    return list(zip(num_words_per_sent, mean_len_words_per_sent))
a_guest
  • 34,165
  • 12
  • 64
  • 118