You can strip the punctuation from the words and also avoid reading all the file into memory:
punc = string.punctuation
return ' '.join(word.strip(punc) for line in fname for word in line.split())
If you want to remove the '
from Nature's
then you will need translate:
from string import punctuation
# use ord of characters you want to replace as keys and what you want to replace them with as values
tbl = {ord(k):"" for k in punctuation}
return ' '.join(line.translate(tbl) for line in fname)
To get the frequency, use a Counter dict:
from collections import Counter
freq = Counter(len(word.translate(tbl)) for line in fname for word in line.split())
Or depending on your approach:
freq = Counter(len(word.strip(punc)) for line in fname for word in line.split())
Using the lines in your question above as an example:
lines =""""When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires
that they should declare the causes which impel them to the separation."""
from collections import Counter
freq = Counter(len(word.strip(punctuation)) for line in lines.splitlines() for word in line.split())
print(freq.most_common())
Outputs tuples of key/value pairings starting with the word length seen the most all the way down to the least, the key is the length and the second element is the frequency:
[(3, 15), (2, 12), (4, 9), (5, 9), (6, 9), (7, 7), (8, 5), (9, 3), (1, 1), (10, 1)]
If you want to output the frequency starting from 1 letter words up without sorting and in order:
mx = max(freq.values())
for i in range(1, mx+1):
v = freq[i]
if v:
print("length {} words appeared {} time/s.".format(i, v) )
Output:
length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.
For a missing key a Counter dict unlike a normal dict will not return a keyError but return a value of 0
so if v
will only be True for word lengths that appeared in the file.
If you want to print the cleaned data putting all the logic in fucntions:
def clean_text(fname):
punc = string.punctuation
return [word.strip(punc) for line in fname for word in line.split()]
def get_freq(cleaned):
return Counter(len(word) for word in cleaned)
def freq_output(d):
mx = max(d.values())
for i in range(1, mx + 1):
v = d[i]
if v:
print("length {} words appeared {} time/s.".format(i, v))
try:
with open(sys.argv[1], 'r') as file_arg:
file_arg.read()
except IndexError:
print('You need to provide a filename as an arguement.')
sys.exit()
fname = open(sys.argv[1], 'r')
formatted_text = clean_text(fname)
print(" ".join(formatted_text))
print()
freq = get_freq(formatted_text)
freq_output(freq)
Which run on your question snippet outputs:
~$ python test.py test.txt
When in the Course of human events it becomes necessary for one people
to dissolve the political bands which have connected them with another
and to assume among the powers of the earth the separate and equal station
to which the Laws of Nature and of Nature's God entitle them a decent
respect to the opinions of mankind requires that they should declare
the causes which impel them to the separation
length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.
If you only care about the frequency output, do it all in one pass:
import sys
import string
def freq_output(fname):
from string import punctuation
tbl = {ord(k): "" for k in punctuation}
d = Counter(len(word.strip(punctuation)) for line in fname for word in line.split())
d = Counter(len(word.translate(tbl)) for line in fname for word in line.split())
mx = max(d.values())
for i in range(1, mx + 1):
v = d[i]
if v:
print("length {} words appeared {} time/s.".format(i, v))
try:
with open(sys.argv[1], 'r') as file_arg:
file_arg.read()
except IndexError:
print('You need to provide a filename as an arguement.')
sys.exit()
fname = open(sys.argv[1], 'r')
freq_output(fname)
using whichever approach is correct for d
.