6

Sometimes it makes sense to have a key-ordered dictionary. In C++, this is often implemented with a Red-black tree. But any self-balancing binary search tree will do (fwiw, Knuth is particularly clear on this subject). The best solution I've been able to come up with so far is to take R. McGraw's AVL-tree type and create a wrapper class that basically implements the STL map interface (also counting on the handy ordering of pairs (two element tuples) in Python). Such a tuple basically corresponds to std::map::value_type.

Yes, there's Python's bisect module, and while that is logarithmic at insertion time in the same way that self-balancing binary trees are logarithmic at insertion time (right?), frankly I just want an object. Called OrderedDict or something (and no, the Python 3.1 OrderedDict doesn't qualify -- that's for 'insertion-time' ordering -- and frankly what insertion-time ordering has to do with ordering is not quite obvious).

Note, a key-ordered dictionary is highly useful in many industries (in finance for instance, it's ordinary to keep track of price books of data, which are basically ordered dictionaries of price -> quantity, aggregated order information, etc.).

If anyone has any other ideas, that's great. All I know is I just got five million times smarter by Alex Martelli's 'answers' here. So I thought I'd ask.

7 Answers7

2

I got exactly the same need, and Alex Martelli's answer has totally convinced me: best is keeping a dictionary and a list of partially sorted keys, then sort when needed. This is efficient because of the very particular behaviour of python's sort algorithm (AKA Timsort). Key-ordered dict in Python

I tested his implementation and mine, and his was best (because he does not insert in the middle of the list)

(I strongly advise you to read the paper linked in AM's comment about the timsort, which is a pearl).

Community
  • 1
  • 1
LeMiz
  • 5,554
  • 5
  • 28
  • 23
  • Thanks. I finally read the full post. Very helpful. With respect to Alex Martelli and everyone's comments, I do think I might try to py-avl wrapper approach (as mentioned first in your post). It's a somewhat unusual case: but I basically need to find a sub-sequence (something like lower_bound and upper_bound in std::map). Even though timsort seems great (and particularly good for avoiding comparisons, while being more aggressive with swapping), something like this is I think more naturally resolved with a tree. O(log n) insert/delete is also nice.. –  Sep 29 '09 at 15:47
  • Final comment, and then I'll let it rest for a while. But so as to clarify what I was saying in the above. It's not so much that I want to find subsequences with lower_bound and upper_bound in log(n) time (bisect can do that too). It's that I may subsequently want to insert an element somewhere in the subsequence (e.g., a sorted a collection of widgets in a layout, sorted by distance away from a reference widget). If you have an iterator to a tree (or a linked list for that matter -- but lower_bound/upper_bound in a linked list is of course O(n)), you're done. –  Sep 29 '09 at 16:11
  • fwiw, here's one solution using that py-avl (with a little patch: http://pastebin.com/d80a3296): http://pastebin.com/f66ca441c (note, this solution is probably incorrect in all ways that are important. the iter_range method is an attempt to emulate lower_bound / upper_bound, though this is the first time i've explored custom iterators. so like i say, it's probably wrong.) –  Sep 29 '09 at 20:38
  • (in my defense though, assuming this isn't an entirely wrong-headed idea, a more correct emulation of lower_bound / upper_bound would probably involve some serious changes to py-avl. like comparison of iterators, etc....) –  Sep 29 '09 at 20:40
2

Lists are a miserable substitute for a tree.

Insertions need to move the whole list around to make space; deletions need to move the list back down. Adding or deleting stuff in batch is fine when it's possible, but it's very often not, or takes unnatural contortions to arrange it. A fundamental attribute of a tree is that insertions and deletions are O(log n); no amount of handwaving will turn O(n) into O(log n).

Inserting an item into a tree when you already know where it's going to go is O(1). Equivalently, deleting an item from a tree based on its node is also O(1). std::map supports both of these. These are both O(n) with a list.

Another fundamental property of a tree is that iterating over a range of values is O(1) per iteration. Combining list and dict loses this, because each iteration needs to do a dict lookup. (The list-of-tuples approach doesn't have this problem.)

Trees are among the most basic of data types. Python's lack of a tree container type is a wart. Maybe there's a third-party library implementing one (eg. the one linked by Mr. "Unknown", which I havn't tried so I can't vouch for), but there's no standard Python type for it.

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131
  • 1
    I must admit I find the lack of a (standardised) hash table in C++ more troublesome in my day to day work than the lack of a tree-based container in Python. But then I've never come across a use case like this where the structure wasn't small enough to just sort it. – Kylotan Sep 29 '09 at 09:42
  • 1
    O(1) insertion of item x: `thelist.append(x)`; O(1) deletion from index i: `thelist[i] = thelist[-1]; del thelist[-1]`. These perturb ordering, but for slight perturbations thelist.sort() is O(N) (thanks, timsort!-) and that only needs to be performed when actually needed (so for most access patterns you get good amortized performance). So, like e.g. Python's lack of tail recursion optimization, the lack of tree containers is very rarely a problem in practice (informal scan of Google's C++ sources: std::hashmap _many_ orders of magnitude more frequent than std::map!-). – Alex Martelli Sep 29 '09 at 15:16
  • They are for solving different kind of problems. In fact, you mean the vector/growable array implementation of a list, not the linked list. Also, a tree is just a list of lists, and you already have those in Python. – fortran Sep 29 '09 at 16:16
  • 1
    Which means insertion and deletion is O(n), exactly as I said, because resorting the list is intrinsic to the modification. Yes, fortran, of course you can implement a tree yourself; that's completely irrelevant. – Glenn Maynard Sep 29 '09 at 20:56
1

You might be looking for a SortedCollection: http://code.activestate.com/recipes/577197-sortedcollection/

Asad Awan
  • 11
  • 1
1

The Python sortedcontainers module provides a SortedDict data type for exactly these purposes. It uses a modified B-tree type data structure and is written in pure-Python. The module has 100% test coverage and hours of stress. Though pure-Python, it's faster that C-implementations and has a performance comparison to back it up.

Because it's pure-Python, installation is a breeze with pip:

pip install sortedcontainers

Then you simply:

from sortedcontainers import SortedDict
help(SortedDict)

The performance comparison includes a pretty comprehensive list of alternative implementations. It's worth a look.

GrantJ
  • 8,162
  • 3
  • 52
  • 46
1

I stumbled on this question needing an OrderedMap myself, and found to my horror that the accepted answer is complete garbage. So I rolled my own, in case anyone finds it useful:

from bisect import *

class OrderedMap:
    """Much less efficient than a dict, 
    but keys don't need to be hashable."""
    __default_arg = object()
    def __init__(self, keyvalues_iter = None):
        self.clear()
        if keyvalues_iter is not None:
            self.update(keyvalues_iter)
    def clear(self):
        self.__keys = []
        self.__values = []
    def __index(self, key):
        if self.__keys:
            index = bisect(self.__keys, key)-1
            if self.__keys[index] == key:
                return index
        raise KeyError(key)
    def __len__(self):
        return len(self.__keys)
    def __contains__(self, key):
        try:
            self.__index(key)
            return True
        except KeyError:
            return False
    def __getitem__(self, key):
        index = self.__index(key)
        return self.__values[index]
    def __setitem__(self, key, value):
        try:
            index = self.__index(key)
            # exists
            self.__values[index] = value
        except KeyError:
            # new
            index = bisect(self.__keys, key)
            self.__keys.insert(index, key)
            self.__values.insert(index, value)
    def __delitem__(self, key):
        index = self.__index(key)
        self.__keys.pop(index)
        self.__values.pop(index)
    def __iter__(self):
        return iter(self.__keys)
    def get(self, key, default=__default_arg):
        try:
            return self[key]
        except KeyError:
            if default != OrderedMap.__default_arg:
                return default
            raise
    def setdefault(self, key, default = None):
        try:
            return self[key]
        except KeyError:
            if default != OrderedMap.__default_arg:
                self[key] = default
                return default
            raise
    def items(self):
        return zip(self.__keys, self.__values)
    def iteritems(self):
        return iter((self.__keys[x], self.__values[x])
                    for x in xrange(len(self)))
    def keys(self):
        return self.__keys[:]
    def iterkeys(self):
        return iter(self.__keys)
    def values(self):
        return self.__values[:]
    def itervalues(self):
        return iter(self.__values)
    def update(self, other):
        for k, v in other.iteritems():
            self[k] = v
    def __repr__(self):
        s = ", ".join("%s: %s" % (repr(self.__keys[x]),
                                  repr(self.__values[x]))
                      for x in xrange(len(self)))
        return "OrderedMap{%s}" % (s,)
porgarmingduod
  • 7,668
  • 10
  • 50
  • 83
  • Not too sure about the usefulness of having a `clear()` method since instances are otherwise immutable once created -- unlike in the accepted answer. – martineau Nov 01 '11 at 20:11
0

For a list that stays sorted, you can try module heapq.

bobflux
  • 11,123
  • 3
  • 27
  • 27
  • 1
    not fully sorted, a heap just imposes ordering between levels of a tree (father with children) but not in the same level (siblings), so for example in a valid heap from lower to higher, h[0] – fortran Oct 01 '09 at 08:47
-1

As you said, you can roll your own implementation with bisect:

class OrderedDict:
  def __init__(self, keyvalues_iter):
    self.__srtlst__ = sorted(keyvalues_iter)
  def __len__(self):
    return len(self.__srtlst__)
  def __contains__(self, key):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      return True
    else:
      return False    
  def __getitem__(self, key):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      return self.__srtlst__[index][1]
    else:
      raise KeyError(key)
  def __setitem__(sekf, key, value):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      self.__srtlst__[index][1] = value
    else:
      self.__srtlst__[index]=(key, value)
  def __delitem__(sekf, key, value):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      del __srtlst__[index]
    else:
      raise KeyError(key)
   def __iter__(self):
     return (v for k,v in self.__srtlst__)
   def clear(self):
     self.__srtlst__ = []
   def get(self, key, default=None):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      return self.__srtlst__[index][1]
    else:
      return default
   def items(self):
     return self.__srtlst__[:]
  def iteritems(self):
    return iter(self.__srtlst__)
  def iterkeys(self):
    return (k for k,v in self.__srtlst__)
  def itervalues(self):
    return (v for k,v in self.__srtlst__)
  def keys(self):
    return [k for k,v in self.__srtlst__]
  def values(self):
    return [v for k,v in self.__srtlst__]
  def setdefault(self, key, default):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      return self.__srtlst__[index][1]
    else:
      self.__srtlst__[index]=(key, default)
      return default
  def update(self, other):
    #a more efficient implementation could be done merging the sorted lists
    for k, v in other.iteritems():
      self[k] = v  

hmmm... it seems that I already did that for you, d'oh!

fortran
  • 74,053
  • 25
  • 135
  • 175
  • The only disadvantage of this implementation, afaik, is that insertion and deletion is potentially O(n) (even though you can find the index in O(log n), if you have to move elements to make way for a new element or remove an existing element, you may need to shift the underlying sorted list in O(n) time). slightly modified version available here: http://pastebin.com/f1e721bdb (but still no insertion method) –  Sep 29 '09 at 15:31
  • yes, of course growing the list can take time, but it should be asymptotically amortized... about the shifting, some tricks could be done to alleviate it (like for example marking empty positions and having an smaller auxiliary list for the costly insertions at the beginning and merge and compact from time to time), but in the end you should optimize for your needs: iterating (list wins), mutating (tree wins) or direct access by key (hashmap wins). – fortran Sep 29 '09 at 16:10
  • 4
    The "only disadvantage" is O(n) modifications instead of O(log n) or O(1)? That's not a minor disadvantage, that's losing a fundamental benefit of a tree. If all you happen to need at the moment is what you can get from this, great, but this is no std::map. – Glenn Maynard Sep 29 '09 at 21:04
  • As I said, everything is about trade-offs, it's the user who has to measure the impact of the different factors and choose the most appropriate structure. If the op is going to iterate sequentially through the map much more than mutating it or accessing by key, then this is a good option. – fortran Mar 19 '10 at 14:20