0

I am trying to create a function to count the number of words and mean length of words in any given sentence or sentences. I can't seem to split the string into two sentences to be put into a list, assuming the sentence has a period and ending the sentence.

  • Question marks and exclamation marks should be replaced by periods to be recognized as a new sentence in the list.
  • For example: "Haven't you eaten 8 oranges today? I don't know if you did." would be: ["Haven't you eaten 8 oranges today", "I don't know if you did"]
  • The mean length for this example would be 44/12 = 3.6
def word_length_list(text):
    text = text.replace('--',' ')

    for p in string.punctuation + "‘’”“":
        text = text.replace(p,'')

    text = text.lower()
    words = text.split(".")
    word_length = []
    print(words)

    for i in words:
        count = 0
        for j in i:
            count = count + 1
        word_length.append(count)
    
    return(word_length)

testing1 = word_length_list("Haven't you eaten 8 oranges today? I don't know if you did.")
print(sum(testing1)/len(testing1))

smci
  • 32,567
  • 20
  • 113
  • 146
  • You're supposed to replace question marks with a period, not nothing! `text = text.replace(p,'')` is wrong – smci Oct 23 '20 at 02:46
  • Does this answer your question? [How to split a string into an array of characters in Python?](https://stackoverflow.com/questions/4978787/how-to-split-a-string-into-an-array-of-characters-in-python) – whoami Oct 23 '20 at 02:47

2 Answers2

1

One option might use re.split:

inp = "Haven't you eaten 8 oranges today? I don't know if you did."
sentences = re.split(r'(?<=[?.!])\s+', inp)
print(sentences)

This prints:

["Haven't you eaten 8 oranges today?", "I don't know if you did."]

We could also use re.findall:

inp = "Haven't you eaten 8 oranges today? I don't know if you did."
sentences = re.findall(r'.*?[?!.]', inp)
print(sentences)  # prints same as above

Note that in both cases we are assuming that period . would only appear as a stop, and not as part of an abbrevation. If period can have multiple contexts, then it could be tricky to tease apart sentences. For example:

Jon L. Skeet earned more point than anyone.  Gordon Linoff also earned a lot of points.

It is not clear here whether period means end of sentence or part of an abbreviation.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • How would you add this to the code I presented? I am getting an error when I try to use it. thanks! Also the period represents the end of a sentence – Ahmad Latif Oct 23 '20 at 02:50
  • Under `def word_length_list(text):` just replace `inp` in my code snippet with `text` and it should work. – Tim Biegeleisen Oct 23 '20 at 02:51
0

An example to split using regex:

import re
s = "Hello! How are you?"
print([x for x in re.split("[\.\?\!]+",s.strip()) if not x == ''])
CryptoFool
  • 21,719
  • 5
  • 26
  • 44
Wasif
  • 14,755
  • 3
  • 14
  • 34
  • You'll get a slightly better result if you move the `.strip()` to the `re.split()`. - I assume you want to skip both truly blank lines and also lines that contain only whitespace. – CryptoFool Oct 23 '20 at 02:52