8

Suppose I have a dictionary

{1:5, 2:5, 4:5}

Is there a data structure such that if I add the key-value pair 3:5, to have it entered in the dictionary so that the keys are in sorted order? i.e.

{1:5, 2:5, 3:5, 4:5}

I am aware of the collections.OrderedDict(), but this only keeps the keys in the order in which they were added (which isn't sufficient for me currently).

I don't want to have to use a normal dictionary dic = {}, then have to use sorted(dic)[0] to grab the smallest key. I'd rather have sorted_dict[0] type function.
The reason for this is that if I use a normal dictionary, I will have to call the sort multiple times, as I am continuously adding pairs to my dictionary.

EDIT: I should have mentioned, it's not only the smallest and largest keys I care about, I will also need to print this dictionary at regular intervals as well...

Tom Jones
  • 111
  • 5
  • with my level of skills, my own class would be even slower than sorted(dict)!! :( – Tom Jones Mar 19 '13 at 05:17
  • 2
    What is the use case for which you need this (the actual problem you are trying to solve). – Burhan Khalid Mar 19 '13 at 05:20
  • a financial market application, the keys will be price, the values will be the volume at that price. as new data comes in i have to update my dictionary, and i want to find out the best and worst price and their volumes multiple times in my script. – Tom Jones Mar 19 '13 at 05:24
  • Perhaps all you need to do is keep track of min and max keys. It seems like you don't need all of the keys sorted. – Steven Rumbalski Mar 19 '13 at 05:30
  • sorry, I should have mentioned, I also want to print this dictionary in order at certain times as well... and need to print them in order – Tom Jones Mar 19 '13 at 05:32
  • Dictionaries are not "sorted" as far as I know. grab the smallest key as min(dic.keys()) – RedBaron Mar 19 '13 at 05:35
  • 2
    @RedBaron: That's going to be O(N) each time he needs the smallest key, and O(N) each time he needs the largest, and of course O(N log N) each time he needs to print the whole thing. He explicitly says in the question that this is not acceptable. – abarnert Mar 19 '13 at 05:49
  • Yes I get that, but maybe he can keep the sorted list of keys in a separate place. Then list[0] gives him the minimum and list[-1] gives him maximum - O(1). It increases the space complexity but is much simpler to implement as he only needs to append the new key to this list and sort it (He has to sort the entire collection anyway everytime a new element is added) – RedBaron Mar 19 '13 at 05:52
  • possible duplicate of [Key-ordered dict in python](http://stackoverflow.com/questions/1319763/key-ordered-dict-in-python) – wim Mar 19 '13 at 05:55
  • 3
    specifically, Alex Martelli has a [nice answer](http://stackoverflow.com/a/1320202/674039) to this question – wim Mar 19 '13 at 05:59
  • @wim: Martelli's answer is for infrequent-writes/frequent-reads. I believe the OP's use case is the exact opposite. On top of that, it's focused on maintaining the O(1) lookups, which the OP doesn't need. (He only cares about access to the first and last values, and iterations.) So, getting O(log N) inserts instead of O(N) (and guaranteed O(N) iteration instead of likely amortized O(N)) probably trumps O(1) instead of O(log N) random access. (And that also means the question isn't a dup.) – abarnert Mar 19 '13 at 06:12
  • @RedBaron: The whole point is that he _doesn't_ need to sort the entire collection every time a new element is added. That's what balanced binary search trees, B+Trees, skiplists, etc. are for; you only need to re-sort O(log N) elements instead of the whole thing. – abarnert Mar 19 '13 at 06:13
  • OK. how do you "undo" a close vote on this site? – wim Mar 19 '13 at 06:17

3 Answers3

5

If you plan to add and remove keys from the dictionary continuously, you really want something that uses an appropriate data structure for the problem—not a hash table (or a hash table plus a list, as with the SortedOrderedDict-type recipes), but a balanced tree (or an equivalent, like a skip list).

If you look around at PyPI, you'll find a number of options. My recommendation would be blist. Even though its data structure may not be quite as optimal as some of the others (because a B+Tree is much broader than a binary tree), it's probably good enough for almost any use case you will run into it. And it has a complete and well-tested interface, including well-tested performance guarantees. And it's used quite a bit by other serious projects.

If you're dealing with one of the rare cases where the tree performance really is critical, you should probably look at the various red-black tree, splay tree, skiplist, etc. implementations. I've used bintrees before, which has a great interface (e.g., you can access the keys and values by index, and even slice the tree, as well as treating it like a dict, and the author has thought through and avoided all the potential ambiguities), but I haven't seriously performance tested it.

Or, if your keys and values really are all smallish integers, you might want to consider using Cython to wrap a C++ map<int, int> in a Pythonic interface. (It's not quite possible to provide a complete interface on top of C++ map, but you often don't need that anyway.) Or, alternatively, modify one of the implementations like bintrees.FastRBTree to store and compare long instead of PyObject*.

On the other hand, if you're just going to create the dictionary all at once and then use it, there's a much simpler answer. Sort it, and stick it in an OrderedDict. Then you don't need anything outside the stdlib.

sorted_dict = collections.OrderedDict(sorted(d.iteritems()))

From a comment on another answer, you say "i don't have permissions to install new modules..."

First, make sure that's really true. You probably do have permission to install modules in a user site-packages directory. Or, if virtualenv is installed and/or you're using 3.3 with built-in venv, even better, you probably have permission to create a venv and install modules into that.

But if so, what you need to do is copy the files from blist/bintrees/whatever into your project.

The problem you might run into is that most of these packages contain C extension modules, which means you have to be able to build them (well, build_ext -i them). If your system doesn't have the Python dev files and a compiler tool chain installed, you can't do that. In that case, you're looking for the best pure-Python solution. bintrees comes with a pure-Python implementation that's identical to the normal C-extension implementation, except slower. It's still O(log N) of course, just the constant factor is a lot higher. If N is big enough, it's still a huge win; if not, it may not be.

If any part of this sounds reasonable, but you need help setting up a per-user site-packages or virtual env, or copying a module into your project in-place, or building extensions in-place, etc., you should probably search for existing questions, and ask a new one if you can't find one (if for no other reason than because the kinds of people who are experts at installation issues aren't necessarily experts at data structures, and may not even be reading this question).

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 2
    yeah, I need to add and remove items in the dictionary continuously, so I'd rather not have to use this sort function every single time (n log n). Will look into your blist solution – Tom Jones Mar 19 '13 at 05:34
  • PS, if you use `blist` for a serious project, email Daniel the author; he likes to keep tabs on how it's being used it so he can push for its including in the stdlib one day (which would obviously make your life easier, as you wouldn't have to install it from PyPI anymore). – abarnert Mar 19 '13 at 05:43
  • Wasn't aware of the `blist` package. After checking it out, definitely a +1. – root Mar 19 '13 at 06:02
3

Try this recipe — http://code.activestate.com/recipes/576998-sorted-dictionary/

It keeps keys sorted using stdlib bisect module.

Leonid Shvechikov
  • 3,927
  • 2
  • 17
  • 14
  • 1
    This works, but it's not going to be as efficient as using the right data structure for the job. – abarnert Mar 19 '13 at 05:41
  • also had a look at the link on this page: http://stutzbachenterprises.com/blist/sorteddict.html. Good, though I'm at work, and I can't use this module there... – Tom Jones Mar 19 '13 at 05:49
  • @XinLiang: That's the same `blist` I recommended. Why can't you use this module? (If you mean you can't use the recipe from ActiveState because of the ambiguous licensing… I've run into that problem before with some employers' legal departments, even though I'm pretty sure that the explicit MIT license overrides the AS license. But anyway, `blist` and `bintrees` are BSD-licensed and MIT-licensed, respectively, so that's not a problem. – abarnert Mar 19 '13 at 05:50
  • i don't have permissions to install new modules... Also, fwiw, I'm quite new to python (3 weeks experience) – Tom Jones Mar 19 '13 at 05:56
  • @XinLiang: I've edited my answer to deal with that. But I think we're getting off-topic on comments for this answer. – abarnert Mar 19 '13 at 06:20
  • Try to copy-paste this recipe and measure speed, it should be rather fast. Otherwise use abarnert's comprehensive instructions :) – Leonid Shvechikov Mar 19 '13 at 06:20
1

More than a year late to the party but I wanted to suggest the sortedcontainers module. Like blist and bintrees, it provides a SortedDict data type that maintains the keys in sorted order. Unlike those modules it's written in pure-Python and is actually faster. SortedDict also supports indexing. Looking up the min/max actually happens in O(1) time.

Since it's pure-Python, installation with pip should be a breeze:

pip install sortedcontainers

Then you can simply import the SortedDict

In [1]: from sortedcontainers import SortedDict

In [2]: d = SortedDict({1:5, 2:5, 4:5})

In [3]: d
Out[3]: SortedDict({1: 5, 2: 5, 4: 5})

In [4]: d[3] = 5

In [5]: d
Out[5]: SortedDict({1: 5, 2: 5, 3: 5, 4: 5})

If you have difficulty installing things using pip or can't copy files that'll need compiling then you can just pull the sortedlist.py and sorteddict.py files out of the depot. All the code is open source on github.

The sortedcontainers module also provides a performance comparison with the most popular suggestions benchmarked against one another.

GrantJ
  • 8,162
  • 3
  • 52
  • 46