Python Create List of Words Per Sentence and Calculate Mean and Place in CSV File

Question

I'm looking to count the number of words per sentence, calculate the mean words per sentence, and put that info into a CSV file. Here's what I have so far. I probably just need to know how to count the number of words before a period. I might be able to figure it out from there.

#Read the data in the text file as a string
with open("PrideAndPrejudice.txt") as pride_file:
    pnp = pride_file.read()

#Change '!' and '?' to '.'
for ch in ['!','?']:
    if ch in pnp:
        pnp = pnp.replace(ch,".")

#Remove period after Dr., Mr., Mrs. (choosing not to include etc. as that often ends a sentence although in can also be in the middle)
pnp = pnp.replace("Dr.","Dr")
pnp = pnp.replace("Mr.","Mr")
pnp = pnp.replace("Mrs.","Mrs")

Have you looked at this question?http://stackoverflow.com/questions/19410018/how-to-count-the-number-of-words-in-a-sentence — Nikolai, Oct 18 '16 at 15:37
Once you've got a list of sentences this is trivial, but identifying sentences is not quite trivial as you can't just split on `.`: http://stackoverflow.com/questions/4576077/python-split-text-on-sentences — Chris_Rands, Oct 18 '16 at 15:41

score 1 · Accepted Answer · answered Oct 18 '16 at 15:40

To split a string into a list of strings on some character:

pnp = pnp.split('.')

Then we can split each of those sentences into a list of strings (words)

pnp = [sentence.split() for sentence in pnp]

Then we get the number of words in each sentence

pnp = [len(sentence) for sentence in pnp]

Then we can use statistics.mean to calculate the mean:

statistics.mean(pnp)

To use statistics you must put import statistics at the top of your file. If you don't recognize the ways I'm reassigning pnp, look up list comprehensions.

score 0 · Answer 2 · answered Oct 18 '16 at 15:39

You might be interested in the split() function for strings. It seems like you're editing your text to make sure all sentences end in a period and every period ends a sentence.

Thus,

pnp.split('.')

is going to give you a list of all sentences. Once you have that list, for each sentence in the list,

sentence.split() # i.e., split according to whitespace by default

will give you a list of words in the sentence.

Is that enough of a start?

score 0 · Answer 3 · answered Oct 18 '16 at 16:01

You can try the code below.

numbers_per_sentence =  [len(element) for element in (element.split() for element in pnp.split("."))]
mean = sum(numbers_per_sentence)/len(numbers_per_sentence)

However, for real natural language processing I would probably recommend a more robust solution such as NLTK. The text manipulation you perform (replacing "?" and "!", removing commas after "Dr.","Mr." and "Mrs.") is probably not enough to be 100% sure that comma is always a sentence separator (and that there are no other sentence separators in your text, even if it happens to be true for Pride And Prejudice)

Python Create List of Words Per Sentence and Calculate Mean and Place in CSV File

3 Answers3