I have an input file with one of Shakespeare's sonnets (sonnet.txt). I need to write short code to count the number of unique words in the sonnet. My code has to remove punctuation and ignore lower / upper case.
Contents of sonnet.txt
How heavy do I journey on the way,
When what I seek, my weary travel's end,
Doth teach that ease and that repose to say,
Thus far the miles are measured from thy friend!
The beast that bears me, tired with my woe,
Plods dully on, to bear that weight in me,
As if by some instinct the wretch did know
His rider loved not speed being made from thee.
The bloody spur cannot provoke him on,
That sometimes anger thrusts into his hide,
Which heavily he answers with a groan,
More sharp to me than spurring to his side;
For that same groan doth put this in my mind,
My grief lies onward, and my joy behind.
I am using the set() function and storing the results in a variable unique_words. The end goal would be to count the length of that set by using len(unique_words).
However, my code is removing words followed by a punctuation mark (i.e., ',' ';' '!'). I have tried to use the filter function to remove non-alphabetic characters, but I'm still losing words followed by punctuation marks.
Is there a different string method I can combine with filter() to get the desired output?
Thank you in advance for your help.
unique_words = set()
sonnet = open("sonnet.txt", "r")
for line in sonnet:
line = [word.lower() for word in line.split()]
line = [word for word in filter(str.isalpha, line)]
unique_words.update(line)
sonnet.close()
print("{} unique words".format(len(unique_words)))
The result of the first comprehension is
['how', 'heavy', 'do', 'i', 'journey', 'on', 'the', 'way,']
But when I iterate the second time this is the output I get:
['how', 'heavy', 'do', 'i', 'journey', 'on', 'the']