I'm doing this week's 'easy' Daily Programmer Challenge on Reddit. The description is at the link, but essentially the challenge is to read a text file from a url and do a word count. Needless to say the resulting output is a fairly large dictionary object. I have a few questions, mostly regarding accessing or sorting keys according to their value.
First, I developed the code according to what I currently understand about OOP and good Python style. I wanted it to be as robust as possible but I also wanted to use the least amount of imported modules. My goal is to become a good programmer, thus I believe it's important to develop a strong foundation and figure out how to do things myself whenever possible. That being said, the code:
from urllib2 import urlopen
class Word(object):
def __init__(self):
self.word_count = {}
def alpha_only(self, word):
"""Converts word to lowercase and removes any non-alphabetic characters."""
x = ''
for letter in word:
s = letter.lower()
if s in 'abcdefghijklmnopqrstuvwxyz':
x += s
if len(x) > 0:
return x
def count(self, line):
"""Takes a line from the file and builds a list of lowercased words containing only alphabetic chars.
Adds each word to word_count if not already present, if present increases the count by 1."""
words = [self.alpha_only(x) for x in line.split(' ') if self.alpha_only(x) != None]
for word in words:
if word in self.word_count:
self.word_count[word] += 1
elif word != None:
self.word_count[word] = 1
class File(object):
def __init__(self,book):
self.book = urlopen(book)
self.word = Word()
def strip_line(self,line):
"""Strips newlines, tabs, and return characters from beginning and end of line. If remaining string > 1,
splits up the line and passes it along to the count method of the word object."""
s = line.strip('\n\r\t')
if s > 1:
self.word.count(s)
def process_book(self):
"""Main processing loop, will not begin processing until the first line after the line containing "START".
After processing it will close the file."""
begin = False
for line in self.book:
if begin == True:
self.strip_line(line)
elif 'START' in line:
begin = True
self.book.close()
book = File('http://www.gutenberg.org/cache/epub/47498/pg47498.txt')
book.process_book()
count = book.word.word_count
So now I have a fairly accurate and robust word count that probably doesn't have any duplicates or blank entries, but is nevertheless a dict object containing over 3k key/value pairs. I can't iterate over it using for k,v in count
or it gives me the exception ValueError: too many values to unpack
, which rules out using list comprehension or mapping to a function to perform any kind of sorting.
I was reading this HowTo on Sorting and playing with it a few minutes ago and noticed that for x in count.items()
lets me iterate through a list of key/value pairs without throwing a ValueError exception, so I removed the line count = book.word.word_count
and added the following:
s_count = sorted(book.word.word_count.items(), key=lambda count: count[1], reverse=True)
# Delete the original dict, it is no longer needed
del book.word.word_count
Now I finally have a sorted list of words, s_count
. PHEW! So, my questions are:
Is a dict even the best data type to perform the original counting? Would a list of tuples like that returned by
count.items()
have been preferable? But that would probably slow it down, right?This seems kind of 'clunky', as I'm building a dict, converting it to a list containing tuples, then sorting the list and returning a new list. However, it is my understanding that dictionaries allow me to perform the fastest lookups, so am I missing something here?
I read briefly about hashing. While I think I understand that the point is that hashing will save space in memory and allow me to perform faster look-ups and comparisons, wouldn't the trade off be that the program becomes more computationally expensive(higher CPU load) because it would then be calculating hashes for each word? Is hashing relevant here?
Any feedback on naming conventions (which I am terrible at), or any other suggestions about basically anything (including style), would be greatly appreciated.