Count unique words in txt tile

Question

I have an input file with one of Shakespeare's sonnets (sonnet.txt). I need to write short code to count the number of unique words in the sonnet. My code has to remove punctuation and ignore lower / upper case.

Contents of sonnet.txt

How heavy do I journey on the way,
When what I seek, my weary travel's end,
Doth teach that ease and that repose to say,
Thus far the miles are measured from thy friend!
The beast that bears me, tired with my woe,
Plods dully on, to bear that weight in me,
As if by some instinct the wretch did know
His rider loved not speed being made from thee.
The bloody spur cannot provoke him on,
That sometimes anger thrusts into his hide,
Which heavily he answers with a groan,
More sharp to me than spurring to his side;
For that same groan doth put this in my mind,
My grief lies onward, and my joy behind.

I am using the set() function and storing the results in a variable unique_words. The end goal would be to count the length of that set by using len(unique_words).

However, my code is removing words followed by a punctuation mark (i.e., ',' ';' '!'). I have tried to use the filter function to remove non-alphabetic characters, but I'm still losing words followed by punctuation marks.

Is there a different string method I can combine with filter() to get the desired output?

Thank you in advance for your help.

unique_words = set()

sonnet = open("sonnet.txt", "r")

for line in sonnet:
    line = [word.lower() for word in line.split()]
    line = [word for word in filter(str.isalpha, line)]
    unique_words.update(line)

sonnet.close()

print("{} unique words".format(len(unique_words)))

The result of the first comprehension is

['how', 'heavy', 'do', 'i', 'journey', 'on', 'the', 'way,']

But when I iterate the second time this is the output I get:

['how', 'heavy', 'do', 'i', 'journey', 'on', 'the']

Your code does exactly what it says on the tin: you're using `filter`, which ... well, filters the result to exclude elements that are not `.isalpha`. So, it filters out everything including spaces - the result is a set of characters (not actually what you said it was, not sure how you got those results). — Grismar, Sep 11 '19 at 03:10
Try using the replace method on the line of text to replace apostrophes, periods, etc. with no space (eg. ""). Then you lowercase all string characters and get the words into your list. — jun, Sep 11 '19 at 03:11
Hi @jun thank you for the suggestion! I used replace with all the characters I wanted to get rid of and it worked :) — bravocharliemike, Sep 11 '19 at 04:40

Manoj Liwera · Answer 1 · 2019-09-11T03:56:48.157

str.isalpha is return true - if all the characters in the string are alphabet.

input - 'Mike' output-true
input - 'charlie mike' output-false
input - 'charlie!,' output-false

In your case applying the isalpha to "way," return false. so its better to remove punctuation using string.punctuation at the start and no need of using the filter.

import string
unique_words = set()

sonnet = open("sonnet.txt", "r")

for line in sonnet:
    line ="".join([c for c in line if c not in string.punctuation])
    line = [word.lower() for word in line.split()]
    unique_words.update(line)

sonnet.close()

print("{} unique words".format(len(unique_words)))

if you need to get both "My" and "my" to the unique word list don't use word.lower()

lenik · Answer 2 · 2019-09-11T03:57:30.180

1

I'd rather do that differently:

import re
from collections import Counter

words = re.findall( r'\w+', text )
counter = Counter( words )
print len(counter)   # prints 95

if I convert all words to the lower case using:

words = [w.lower() for w in words]

before counting, the result is 90.

edited Sep 11 '19 at 03:57

answered Sep 11 '19 at 03:51

lenik

23,228
4
34
43

score 0 · Answer 3 · answered Sep 11 '19 at 03:21

Staying as close as possible to your example, but fixing the problem with it:

unique_words = set()

sonnet = open("sonnet.txt", "r")

for line in sonnet:
    words = ''.join(filter(lambda x: x.isalpha() or x.isspace(), line)).split()
    unique_words.update(words)

sonnet.close()

print("{} unique words".format(len(unique_words)))

Instead of just checking for .isalpha(), you want to keep spaces as well, so they are combined in a single lambda function to use filter like you intended. The resulting filter generator is then turned back into a string by ''.join(generator) and that line is split (over the spaces that are in it).

The result is called words instead of overwriting the loop variable line for clarity and the words are added to the result.

The output:

94 unique words

Hi @Grismar thanks for showing me the lambda function. I didn't know about it, but now it's been added to my toolbox :) — bravocharliemike, Sep 11 '19 at 04:41

Amit Singh · Answer 4 · 2019-09-11T04:26:26.883

0

import string

l = []
with open("sonnet.txt","r") as f:
     s = f.read().strip()
     l = l + s.translate(str.maketrans('', '', string.punctuation)).split()

print(len(set(l)))

Removal of punctuations from string is taken from this post. I'm treating words with difference in case as different words. If we want to consider case changes we can simply modify this line.

s = f.read().strip() to s.f.read().strip().lower()

edited Sep 11 '19 at 04:26

answered Sep 11 '19 at 04:18

Amit Singh

2,875
14
30

Hi Amit, I'd import string if my assignment allowed me :( – bravocharliemike Sep 11 '19 at 04:43
Hi ```string.punctuation``` is just a string ```!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~``` – Amit Singh Sep 11 '19 at 05:41
@bravocharliemike If you cannot use string.punctuation you can just replace it by the corresponding string value ```'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'``` – Amit Singh Sep 11 '19 at 05:42
Or you can add your own custom string for all the punctuation marks you want to ignore. – Amit Singh Sep 11 '19 at 05:42

Count unique words in txt tile

4 Answers4