6

there's heapq in python, for general usage. i want recording topN(0~20) for 10e7 records.

if use heapq, should use '-' to translate max to min; and recording a min number of bottom, to call heapq.heappushpop()

should i use heapq or self implement a heap(maybe buggy or less efficient)?

#update

import heapq
class TopN(object):
    """
    v format: (num, value)

    after looking into http://hg.python.org/cpython/file/2.7/Lib/heapq.py, 
    i find heappushpop already optimize, no need bottom value

    feed() can be optimize further, if needed:
        using func object instead of compare len(self.h) each time
    """
    def __init__(self, N):
        self.N = N
        self.h = []        

    def feed(self, v):  
        if len(self.h) < self.N:
            heapq.heappush(self.h, v)
        else:
            heapq.heappushpop(self.h, v)

    def result(self):
        self.h.sort(reverse=True)
        return self.h

def t_topn():
    topn = TopN(10)
    for i in xrange(5):
        topn.feed((i, str(i)))
    res = topn.result()    
    assert sorted(res, reverse=True) == res 

def t_topn_random():
    import random
    topn = TopN(10)
    for i in xrange(100):
        x = random.randint(0, 1e4)
        topn.feed((x, str(x)))
    res = topn.result()    
    assert sorted(res, reverse=True) == res 

if __name__ == '__main__':
    t_topn()
    t_topn_random()
whi
  • 2,685
  • 6
  • 33
  • 40
  • 1
    Well, what makes you think you _shouldn't_ use `heapq`? If it can do what you want, and it doesn't have any problems, and it's already built and tested, why is there even a question? – abarnert Jan 07 '13 at 04:01
  • If you want the top 20, then you can use `heapq.heappushpop(heap, x)` as it is, since it will push `x` onto the heap, then pop the *smallest* item from the heap. – unutbu Jan 07 '13 at 04:21
  • @unutbu: It sounds like he wants the opposite. Maybe these are golf scores? – abarnert Jan 08 '13 at 02:20
  • @abarnert: Well, his code seems to indicate he really is only interested in the largest values. – unutbu Jan 08 '13 at 02:37
  • @whi: Why are you pushing `(i, str(i))` to the heap? Pushing `i` would suffice. Then, at the end, you can convert your top N values to `str`s if you like. This will save you about 10**7 calls to `str` assuming `N << 10**7`. – unutbu Jan 08 '13 at 02:40
  • @unutbu: this is a test, application meaning: (count:key), i can also reverse it, to make a dictionary for all data {key:count} – whi Jan 08 '13 at 11:11
  • @abarnert: yes, i use it to record a fixed largest-N heap – whi Jan 08 '13 at 11:13

1 Answers1

19

The only problem with heapq is that it doesn't provide a key function like everything else in the stdlib does. (If you're curious why, Raymond Hettinger explains in this email. He's right that heapq couldn't provide the same interface as other sort functions—but the reasons don't affect your use case, where key would just be lambda x: -x.)

The usual workaround is to decorate-heap-undecorate. That is, put a modified version of your values into the heap that sorts by key. Normally, this means one of the following:

  • Storing key(x) instead of x, and then accessing unkey(value) instead of value (assuming key is reversible).
  • Storing (key(x), x) instead of x, and then accessing value[1]. (This can break stability, but heapq doesn't promise stability anyway.)
  • Writing a wrapper class that implements a custom __le__ method, then storing Wrapper(x) instead of x and accessing value.value instead of value.

In your case, the key function is reversible. So, just store -x, and access -value. That's about as trivial as decoration gets.

Still, regardless of how simple it is, you should probably write a wrapper, or you will screw it up at some point. For example, you could write a maxheap that wraps the minheap in heapq like this:

import heapq
def heapify(x):
    for i in range(len(x)):
        x[i] = -x[i]
    heapq.heapify(x)
def heappush(heap, item):
    heapq.heappush(heap, -item)
def heappop(heap):
    return -heapq.heappop(heap)

… and so on for any other functions you need. It may be a bit of a pain, but it's a lot less work than implementing the whole thing from scratch.

While you're at it, you may want to wrap the heap in an object-oriented API so you can do heap.push(x) instead of heapq.heappush(heap, x), etc.

import heapq
class MaxHeap(object):
    def __init__(self, x):
        self.heap = [-e for e in x]
        heapq.heapify(self.heap)
    def push(self, value):
        heapq.heappush(self.heap, -value)
    def pop(self):
        return -heapq.heappop(self.heap)

If you take a quick look around the recipes at ActiveState or the modules on PyPI, you should find that others have already done most of the work for you.

Alternatively, you could copy and paste the heapq source (it's pure Python) as maxheapq.py and just replace the cmp_lt function with its opposite. (Of course if you're doing that, it's probably just as easy, and certainly a lot clearer, to modify cmp_lt to take a key argument in the first place, and modify all the other functions to pass the key through—bearing in mind that it won't be as generally applicable anymore, because it can't make the usual guarantee that key is only called once.)

If you really want to live dangerously (you shouldn't), you could even monkeypatch it:

import heapq
def cmp_gt(x, y):
    return y < x if hasattr(y, '__lt__') else not (x <= y)
heapq.cmp_lt = cmp_gt

But you don't want to do that in real code.

spookylukey
  • 6,380
  • 1
  • 31
  • 34
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • @ScottDavidTesler: If you look at [the documentation for the module](http://docs.python.org/3/library/heapq.html), there's a link to the source code right there at the top. Just follow it. – abarnert Nov 19 '13 at 20:48
  • @ScottDavidTesler: Or just `import heapq; print(heapq.__file__)` and you'll usually get the path to a local copy on your system. (This won't work if your Python installation puts the whole stdlib into a zipfile, or keeps .pyc files only, or various other tricks that are common on embedded systems.) – abarnert Nov 19 '13 at 20:50
  • @ScottDavidTesler: Finally, if you go to the Python website and click "Source" under "Download", or go to the first Google hit for "Python source", or just about any way of looking for it, you will either get [this page](http://www.python.org/getit/source/) with a link to the repo, or [the repo itself](http://hg.python.org/cpython/file/). From there, all Python modules (like `heapq.py`) are under Lib, and all C extension modules (like heapq's hidden accelerator code in `_heapmodule.c`) are under Modules. – abarnert Nov 19 '13 at 20:52