17

I'm relatively new to python (using v3.x syntax) and would appreciate notes regarding complexity and performance of heapq vs. sorted.

I've already implemented a heapq based solution for a greedy 'find the best job schedule' algorithm. But then I've learned about the possibility of using 'sorted' together with operator.itemgetter() and reverse=True.

Sadly, I could not find any explanation on expected complexity and/or performance of 'sorted' vs. heapq.

ofer.sheffer
  • 5,417
  • 7
  • 25
  • 26
  • It's not clear what exactly you're trying to do. If you're looking for the largest thing in a list, `max(L)` will be faster than creating a heap, and much faster than calling `sorted`. – Dan R Jul 10 '14 at 03:09
  • @DanRoche, apologies if this is a silly question, but is there a way to delete/pop from a collection with max(L) ? – ofer.sheffer Jul 10 '14 at 03:57
  • if `L` is a list than you can delete the largest element via `L.remove(max(L))`. That is somewhat wasteful as it makes two passes when only one is really necessary, but it should still be faster than building a heap. – Dan R Jul 10 '14 at 05:23

3 Answers3

17

If you use binary heap to pop all elements in order, the thing you do is basically heapsort. It is slower than sort algorightm in sorted function apart from it's implementation is pure python.

The heapq is faster than sorted in case if you need to add elements on the fly i.e. additions and insertions could come in unspecified order. Adding new element preserving inner order in any heap is faster than resorting array after each insertion.

The sorted is faster if you will need to retrieve all elements in order later.

The only problem where they can compete - if you need some portion of smallest (or largest) elements from collection. Although there are special algorigthms for that case, whether heapq or sorted will be faster here depends on the size of the initial array and portion you'll need to extract.

Odomontois
  • 15,918
  • 2
  • 36
  • 71
  • In this case, both are perfectly optimized (I assume) and everything you say is true. But if I'd like to test the complexity and performance to make sure which one is faster than the other, how would I do that? – ofer.sheffer Jul 10 '14 at 18:09
  • 1
    [Rules of Optimization](http://c2.com/cgi/wiki?RulesOfOptimization) 1. Don't 2. Don't… yet. 3. Profile first. My quick and dirty profiling of a) creating a list of 10,000 random numbers and calling `sorted` on it and b) creating 10,000 more numbers and using `heapq.heappush` to build the list yielded a 28% difference in time. That sounds impressive until you look at the magnitude: about 230 nanoseconds per element (which algorithm? I find it hard to find a case where that choice dominates.). – msw Jul 11 '14 at 00:04
3

The nlargest() and nsmallest() functions of heapq are most appropriate if you are trying to find a relatively small number of items. If you want to find simply single smallest or largest number , min() and max() are most suitable, because it's faster and uses sorted and then slicing. If you are looking for the N smallest or largest items and N is small compared to the overall size of the collection, these functions provide superior performance. Although it's not necessary to use heapq in your code, it's just an interesting topic and a worthwhile subject of study.

vikas0713
  • 566
  • 7
  • 9
2

heapq is implemented as a binary heap, The key things to note about binary heaps, and by extension, heapq:

  1. Searching is not supported
  2. Insertions are constant time on average
  3. Deletions are O(log n) time on average

Additional binary heap info described here: http://en.wikipedia.org/wiki/Binary_heap

While heapq is a data structure which has the properties of a binary heap, using sorted is a different concept. sorted returns a sorted list, so that's essentially a result, whereas the heapq is a data structure you are continually working with, which could, optionally, be sorted via sorted.

Additonal sorted info here: https://docs.python.org/3.4/library/functions.html#sorted

What specifically are you trying to accomplish?

Response to OP's comment:

Why do you think you need a heapq specifically? A binary heap is a specialized data structure, and depending on your requirements, it's quite likely not necessary.

You seem to be extremely concerned about performance, but it's not clear why. If something is a "bad performer", but its aggregate time is not significant, then it really doesn't matter in the bigger picture. In the aggregate case, a dict or a list would perform generally perform fine. Why do you specifically think a heapq is needed?

I wonder if this is a don't-let-the-perfect-be-the-enemy-of-the-good type of situation.

Writing Python using C extensions is a niche use case reserved for cases where performance is truly a significant issue. (i.e. it may be better to use, say, an XML parser that is a C extension than something that is pure Python if you're dealing with large files and if performance is your main concern).

Regarding In complex keep playing with structure case: could it be faster to sort with sorted and add elements via .append():

I'm still not clear what the use case is here. As I mentioned above, sorted and heapq are really two different concepts.

What is the use case for which you are so concerned about performance? (Absent other factors not yet specified, I think you may be overly emphasizing the importance of best-case performance in your code here.)

khampson
  • 14,700
  • 4
  • 41
  • 43
  • 3
    Insertions are constant on average; in general, they are O(log n). (Using amortized analysis, they are also constant, as *n* insertions will take O(n) time total.) – chepner Jul 10 '14 at 04:05
  • True, I edited my response to reflect the avg; I omitted that inadvertently. – khampson Jul 10 '14 at 04:25
  • @ken-hampson, my class has many different assignments. If it's the super-basic case: arrange once to pop smallest in order. Can I assume 'sorted' is the best choice? What is the implementation? Plus there's the 'written in C'/'pure python' performance related concept which is quite new to me. In complex keep playing with structure case: could it be faster to sort with sorted and add elements via .append() ?? -- hope this clears up my concerns. – ofer.sheffer Jul 10 '14 at 04:35
  • Added add'l info to answer w/ questions. – khampson Jul 10 '14 at 04:58
  • 2
    @chepner, your statement about the amortized complexity of *n* insertions is not correct, at least if you are using amortized in the usual sense of "worst-case time averaged over all operations". In particular, if the elements are inserted in reverse sorted order, the amortized cost is really Ω(log n) per insertion. You may be confusing with the O(n) cost to build the whole heap at once. – Dan R Jul 10 '14 at 05:26
  • @KenHampson, thank you for the input. I'm currently trying to make myself aware of as many options as I can so I can get the best results that don't take too long to implement. In that sense, I understand what you're saying about perfect as the enemy. Also: see my note above to Odomontois. – ofer.sheffer Jul 10 '14 at 18:15