0

In python I am trying to take in a file of text. And searching through each character, when I find a capital, I want to keep track of the number of characters after until I find a '?', '!', or '.' Basically, I am reading in large files of text and trying to calculate how many sentences there are and the total characters to find average sentence length. (I know there will be some bugs with things such as Mr. or E.g., but I can live with the bugs. The data set is so large that the error will be negligible.)

char = ''
for line in sys.stdin:
  words = line
  for char in words:
    if char.isupper():
      # read each char until you see a ?,!, or . and keep track 
      # of the number of characters in the sentence.

2 Answers2

0

You might want to use the nltk module for tokenizing sentences instead of trying to reinvent the wheel. It covers all kinds of corner cases like parenthesis and other strange sentence structures.

It has a sentence tokenizer nltk.sent_tokenize. Note that you'll have to download the english model using nltk.download() before using it.

Here is how you would solve your problem using nltk:

 sentences = nltk.sent_tokenize(stdin.read())

 print sum( len(s) for s in sentences ) / float(len(sentences))
Donald Miner
  • 38,889
  • 8
  • 95
  • 118
0

This solution works if you want to go line by line from stdin like your current code. It counts across breaks using a two-state machine.

import sys

in_a_sentence = False
count = 0
lengths = []

for line in sys.stdin:
    for char in line:
        if char.isupper():
            in_a_sentence = True
        elif char in '.?!':
            lengths.append(count+1)
            in_a_sentence = False
            count = 0

        if in_a_sentence:
            count += 1

print lengths

output:

mbp:scratch geo$ python ./count.py
This is a test of the counter. This test includes
line breaks. See? Pretty awesome,
huh!
^D[30, 31, 4, 20]

But if you were able to read the whole thing in at once into one variable, you could do something more like:

import re
import sys

data = sys.stdin.read()
lengths = [len(x) for x in re.findall(r'[A-Z][^.?!]*[.?!]', data)]

print lengths

That'll give you the same results.

geoelectric
  • 286
  • 1
  • 5