Performing Counts, Sorting/mapping Large Dicts

Question

I'm doing this week's 'easy' Daily Programmer Challenge on Reddit. The description is at the link, but essentially the challenge is to read a text file from a url and do a word count. Needless to say the resulting output is a fairly large dictionary object. I have a few questions, mostly regarding accessing or sorting keys according to their value.

First, I developed the code according to what I currently understand about OOP and good Python style. I wanted it to be as robust as possible but I also wanted to use the least amount of imported modules. My goal is to become a good programmer, thus I believe it's important to develop a strong foundation and figure out how to do things myself whenever possible. That being said, the code:

from urllib2 import urlopen

class Word(object):

    def __init__(self):
        self.word_count = {}    

    def alpha_only(self, word):
        """Converts word to lowercase and removes any non-alphabetic characters."""
        x = ''
        for letter in word:
            s = letter.lower()
            if s in 'abcdefghijklmnopqrstuvwxyz':
                x += s
        if len(x) > 0:
            return x    

    def count(self, line):
        """Takes a line from the file and builds a list of lowercased words containing only alphabetic chars.
            Adds each word to word_count if not already present, if present increases the count by 1."""
        words = [self.alpha_only(x) for x in line.split(' ') if self.alpha_only(x) != None]
        for word in words:
            if word in self.word_count:
                self.word_count[word] += 1
            elif word != None:
                self.word_count[word] = 1

class File(object):

    def __init__(self,book):
        self.book = urlopen(book)               
        self.word = Word()

    def strip_line(self,line):
        """Strips newlines, tabs, and return characters from beginning and end of line. If remaining string > 1,
            splits up the line and passes it along to the count method of the word object."""
        s = line.strip('\n\r\t')
        if s > 1:
            self.word.count(s)

    def process_book(self):
        """Main processing loop, will not begin processing until the first line after the line containing "START".
            After processing it will close the file."""
        begin = False
        for line in self.book:
            if begin == True:
                self.strip_line(line)
            elif 'START' in line:
                begin = True
        self.book.close()

book = File('http://www.gutenberg.org/cache/epub/47498/pg47498.txt')

book.process_book()

count = book.word.word_count

So now I have a fairly accurate and robust word count that probably doesn't have any duplicates or blank entries, but is nevertheless a dict object containing over 3k key/value pairs. I can't iterate over it using for k,v in count or it gives me the exception ValueError: too many values to unpack, which rules out using list comprehension or mapping to a function to perform any kind of sorting.

I was reading this HowTo on Sorting and playing with it a few minutes ago and noticed that for x in count.items() lets me iterate through a list of key/value pairs without throwing a ValueError exception, so I removed the line count = book.word.word_count and added the following:

s_count = sorted(book.word.word_count.items(), key=lambda count: count[1], reverse=True)

# Delete the original dict, it is no longer needed
del book.word.word_count

Now I finally have a sorted list of words, s_count. PHEW! So, my questions are:

Is a dict even the best data type to perform the original counting? Would a list of tuples like that returned by count.items() have been preferable? But that would probably slow it down, right?
This seems kind of 'clunky', as I'm building a dict, converting it to a list containing tuples, then sorting the list and returning a new list. However, it is my understanding that dictionaries allow me to perform the fastest lookups, so am I missing something here?
I read briefly about hashing. While I think I understand that the point is that hashing will save space in memory and allow me to perform faster look-ups and comparisons, wouldn't the trade off be that the program becomes more computationally expensive(higher CPU load) because it would then be calculating hashes for each word? Is hashing relevant here?
Any feedback on naming conventions (which I am terrible at), or any other suggestions about basically anything (including style), would be greatly appreciated.

A dictionary is a nice way to store word:count values. You can iterate over `count` by doing `for k,v in count.iteritems()`. — 101, Dec 07 '14 at 05:24
[CodeReview](http://codereview.stackexchange.com/) is the best place for this type of question. In the meantime, [```collections.Counter```](https://docs.python.org/2.7/library/collections.html#collections.Counter) might make things easier for you. — wwii, Dec 07 '14 at 05:56

score 2 · Accepted Answer · edited Apr 13 '17 at 12:40

Are you sure that for k,v in count: gives the exception ValueError: too many values to unpack? I expect it to give ValueError: need more than 1 value to unpack.

When you use a dict as an iterator (eg in a for loop) you just get the keys, you don't get the values. If you want key, value pairs you need to use the dict's iteritems() method as mentioned by figs in the comment (or in Python 3 the items() method).

Of course, you can always do something like:

for k in count:
    print k, count[k]

...

I think that most of your questions are more suited to Code Review than to Stack Overflow. But since you've asked so nicely here, I'll mention a few points. :)

It's rather inefficient to build up a string char by char, so your alpha_only() method would be better if it collected chars in a list then used the str.join() method to join them into a single string. The usual Python idiom would do that using a list comprehension.

The list comprehension in your count() method calls alpha_only() twice for each word, which is in efficient.

You could make your strip() call simpler by using the default argument, as that strips all white space (and you don't need to preserve space chars in this application). Similarly, using split() with its default arg will split on any runs of blank space, which is probably better in this application, since giving an arg of a single space means that you'll get some empty strings in the list returned by split if there are any runs of multiple spaces within a line.

...

You mention hashing in your question, and whether it's useful for this application. Yes, it is. Python dictionaries actually use hashing of their keys, so you don't need to worry about the details. And yes, a dictionary is a good data structure to use for this task. There are fancier forms of dictionary that make things a bit simpler, but to use them does require importing a (standard) module. But using a dictionary (of some flavour or another) to hold data and then generating a list of tuples from it for final sorting is a fairly common practice in Python. And there's no need to specifically delete the dictionary when you've finished with it if the program's about to terminate anyway.

...

As for the duplicated call of alpha_only(), whenever you find yourself doing that sort of thing it's a sign that a list comprehension isn't really suitable for the task and that you should just use a normal for loop so that you can save the result of the function call rather than having to recalculate it. Eg,

words = []
for word in line.split():
    word = self.alpha_only(word)
    if word is not None:
        words.append(word)

Thank you! This is all very helpful information and I appreciate you taking the time to respond. Is there a way I can repost in Code Review, or would it not be worth it at this point? — JtheDude, Dec 07 '14 at 06:30
In regards to my list comprehension I was reluctant to put the second call to `alpha_only()` in there, but otherwise it returns `None` for any list items that are numerical or symbols, which I was trying to avoid. I suppose that is less inefficient than calling it twice, and my `for` loop won't add any `None` keys to the dict so I will take it out, and rewrite the code with your suggestions. Then I'll post the updated code to Code Review :-) — JtheDude, Dec 07 '14 at 06:35
@JtheDude: That sounds like a good plan. FWIW, you can read about [migration](http://meta.stackexchange.com/questions/10249/what-is-migration-and-how-does-it-work) of questions to other communities. As for the duplicated call of `alpha_only()`, I'll respond in my answer. — PM 2Ring, Dec 07 '14 at 06:37
Since I'm catching list items with a value of `None` in my `count()` method with `...elif word != None` anyway, would it be redundant to also catch them with a `for` loop? I updated `alpha_only()` to use a list comprehension that looks like `x = ''.join([x.lower() for x in word if x.isalpha()])`, which still returns `None` for some list items but gets it all done in one fell swoop. I just nested it in an `if` statement that makes sure the incoming word in not actually a number. — JtheDude, Dec 07 '14 at 07:09
@JtheDude: Yeah, there's nothing wrong with putting `None`s into the list comprehension and catching them further down the track, since you're processing data line by line, so the list isn't going to be very big anyway. But as a general strategy it's good to avoid collecting useless data. :) BTW, it's better to use `if word is not None` than `if word != None`. There's some discussion about this [here](http://stackoverflow.com/questions/3289601/null-object-in-python). — PM 2Ring, Dec 07 '14 at 07:18
I was trying to be mindful of memory usage so I didn't want to dump a whole entire ebook into memory with `read()` or `readlines()`. Am I correct in this line of reasoning? Or is the whole thing already sitting in memory as soon as I pass it to the `open()` function? I did read the documentation on `open()` and I don't recall if there was anything that indicated one way or the other. Edit: Found the answer at http://stackoverflow.com/questions/2239888/does-the-python-open-function-save-its-content-in-memory-or-in-a-temp-file — JtheDude, Dec 07 '14 at 07:27
@JtheDude: Well, the OS does perform some file buffering, so if the file is small enough it _may_ be all sitting in RAM once it's opened. However, as far as Python's concerned file data doesn't get read into memory until you explicitly read it. So it's more memory-efficient to process a text file line by line as you are doing, rather than slurping the whole thing in with `read()` or `readlines()`. OTOH, unless you're processing huge ebooks, they're probably small enough to read the whole thing in if you want to, and it _might_ be slightly faster. — PM 2Ring, Dec 07 '14 at 07:35

Performing Counts, Sorting/mapping Large Dicts

1 Answers1