How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

Question

I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:

Total word count
Total count of unique words (without case and special characters interfering)
The number of sentences
Average words in a sentence
Find common used phrases (a phrase of 3 or more words used over 3 times)
A list of words used, in order of descending frequency (without case and special characters interfering)
The ability to accept input from STDIN, or from a file specified on the command line

So far I have this Python program to print total word count:

with open('/Users/name/Desktop/20words.txt', 'r') as f:

     p = f.read()

     words = p.split()

     wordCount = len(words)
     print "The total word count is:", wordCount

So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)

 file=open("/Users/name/Desktop/20words.txt", "r+")

 wordcount={}

 for word in file.read().split():

     if word not in wordcount:
         wordcount[word] = 1
     else:
         wordcount[word] += 1
 for k, v in wordcount.items():
     print k, v

Thank you for any help you can give!

I'm afraid this is a bit too broad. You are basically asking us to write the entire program for you. What part do you have problems with specifically? You might try regular expressions for detecting words and sentences, or even take a look at some NLP toolkit. — tobias_k, Jun 23 '15 at 12:54
Currently, I am having difficulty coming up with a Python program that can read a txt file and output the total amount of unique words. I can only figure out how to print all the unique words and their occurrences. (I don't expect anyone to write the entire program, but to help with any part of it...) I am having difficulties with all questions except #1. I pretty much have #6, I am just trying to find how to output the unique words in descending order by frequency. — Crystal, Jun 23 '15 at 23:16

score 1 · Answer 1 · answered Jun 23 '15 at 12:57

1

If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.

word = word.strip().strip("'").strip('"')...

This will remove the occurrence of these characters on the extremities of the word. This probably isn't as efficient as using some NLP library, but it can get the job done.

str.strip Docs

answered Jun 23 '15 at 12:57

ssundarraj

809
7
16

Thank you I added this! Would you know of a way to now sort the words according to # of occurrences of each unique word? – Crystal Jun 23 '15 at 23:26
You can create a defaultdict with the keys being the words and the values being the count of each word. Then you can sort the dict (http://stackoverflow.com/a/3177911/2441165). Hope this helps. – ssundarraj Jun 24 '15 at 05:39

tobias_k · Accepted Answer · 2015-06-24T07:47:08.053

Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.

This should help you getting startet:

import re, collections
text = """Sentences start with an upper-case letter. Do they always end 
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two, 
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""

sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)    
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
    print n, s

word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
    print n, w

For "more power", you could use some natural language toolkit, but this might be a bit much for this task.

Hi Tobias_k, I am having an error saying that: "name 're' is not defined" — Crystal, Jun 23 '15 at 23:23
@xgrioux Sorry, I thought that was clear: You have to import the `re` and `collections` modules. See my edit. — tobias_k, Jun 24 '15 at 07:47

How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

2 Answers2