0

I'm looking to count the number of words per sentence, calculate the mean words per sentence, and put that info into a CSV file. Here's what I have so far. I probably just need to know how to count the number of words before a period. I might be able to figure it out from there.

#Read the data in the text file as a string
with open("PrideAndPrejudice.txt") as pride_file:
    pnp = pride_file.read()

#Change '!' and '?' to '.'
for ch in ['!','?']:
    if ch in pnp:
        pnp = pnp.replace(ch,".")

#Remove period after Dr., Mr., Mrs. (choosing not to include etc. as that often ends a sentence although in can also be in the middle)
pnp = pnp.replace("Dr.","Dr")
pnp = pnp.replace("Mr.","Mr")
pnp = pnp.replace("Mrs.","Mrs")
  • Have you looked at this question?http://stackoverflow.com/questions/19410018/how-to-count-the-number-of-words-in-a-sentence – Nikolai Oct 18 '16 at 15:37
  • Once you've got a list of sentences this is trivial, but identifying sentences is not quite trivial as you can't just split on `.`: http://stackoverflow.com/questions/4576077/python-split-text-on-sentences – Chris_Rands Oct 18 '16 at 15:41

3 Answers3

1

To split a string into a list of strings on some character:

pnp = pnp.split('.')

Then we can split each of those sentences into a list of strings (words)

pnp = [sentence.split() for sentence in pnp]

Then we get the number of words in each sentence

pnp = [len(sentence) for sentence in pnp]

Then we can use statistics.mean to calculate the mean:

statistics.mean(pnp)

To use statistics you must put import statistics at the top of your file. If you don't recognize the ways I'm reassigning pnp, look up list comprehensions.

Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96
0

You might be interested in the split() function for strings. It seems like you're editing your text to make sure all sentences end in a period and every period ends a sentence.

Thus,

pnp.split('.')

is going to give you a list of all sentences. Once you have that list, for each sentence in the list,

sentence.split() # i.e., split according to whitespace by default

will give you a list of words in the sentence.

Is that enough of a start?

Jared N
  • 162
  • 7
0

You can try the code below.

numbers_per_sentence =  [len(element) for element in (element.split() for element in pnp.split("."))]
mean = sum(numbers_per_sentence)/len(numbers_per_sentence)

However, for real natural language processing I would probably recommend a more robust solution such as NLTK. The text manipulation you perform (replacing "?" and "!", removing commas after "Dr.","Mr." and "Mrs.") is probably not enough to be 100% sure that comma is always a sentence separator (and that there are no other sentence separators in your text, even if it happens to be true for Pride And Prejudice)

Dawid
  • 652
  • 1
  • 11
  • 24