3

I am interested in a dict implementation for Python that provides an iterating interface to sorted values. I.e., a dict with a "sortedvalues()" function.

Naively one can do sorted(dict.values()) but that's not what I want. Every time items are inserted or deleted, one has to run a full sorting which isn't efficient.

Note that I am not asking about key-sorted dict either (for that question, there are excellent answers in Key-ordered dict in Python and Python 2.6 TreeMap/SortedDictionary?).

Community
  • 1
  • 1
Ling
  • 535
  • 4
  • 9

4 Answers4

3

One solution is to write a class that inherits from dict but also maintains a list of keys sorted by their value (sorted_keys), along with the list of corresponding (sorted) values (sorted_values).

You can then define a __setitem__() method that uses the bisect module in order to know quickly the position k where the new (key, value) pair should be inserted in the two lists. You can then insert the new key and the new value both in the dictionary itself, and in the two lists that you maintain, with sorted_values[k:k] = [new_value] and sorted_keys[k:k] = [new_key]; unfortunately, the time complexity of such an insertion is O(n) (so O(n^2) for the whole dictionary).

Another approach to the ordered element insertion would be to use the heapq module and insert (value, key) pairs in it. This works in O(log n) instead of the list-based approach of the previous paragraph.

Iterating over the dictionary can then simply done by iterating over the list of keys (sorted_keys) that you maintain.

This method saves you the time it would take to sort the keys each time you want to iterate over the dictionary (with sorted values), by basically shifting (and increasing, unfortunately) this time cost to the construction of the sorted lists of keys and values.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • Inserting in lists usually is `O(n)`, when they are array based. Which is very likely the case, as otherwise `list[123]` will be `O(n)` instead of `O(1)`. – Has QUIT--Anony-Mousse Jan 29 '12 at 16:37
  • @Anony-Mousse: Thanks! You're right (http://wiki.python.org/moin/TimeComplexity). – Eric O. Lebigot Jan 29 '12 at 16:43
  • 1
    One approach you could use is to simply sort the list each time an item is appended. Python's sort is very efficient when a list is *almost* in sorted order. It *might* even be faster than `bisect` since `bisect` is written in Python while `sort` is written in C. – kindall Jan 29 '12 at 16:44
  • However that operation can be quite well optimized by the CPU, and may beat a naive tree implementation. I was also thinking about recommending to use a sorted list with bisect. The problem however is to maintain the `dict` side. If you use a dict that maps to the list position of the key, you need to also do `O(n)` updates to this list on any insert or delete from the dictionary. And as I understood the question, it should still be a `dict`. – Has QUIT--Anony-Mousse Jan 29 '12 at 16:46
  • @Anony-Mousse: You can overcome the `O(n)` updates by keeping `(value, key)` pairs in a heap. – Eric O. Lebigot Jan 29 '12 at 16:51
  • The problem is when you get `a["abc"] = 1`, then "a["abc"] = 2`, you will then need to locate and remove the old value of `"abc"`. I.e. update and remove by key are the difficult operations. As long as it is insert only, it's nothing but maintaining a sorted set of the values. – Has QUIT--Anony-Mousse Jan 29 '12 at 16:55
  • @Anony-Mousse Yes, it has to be basically a dict since I want to update by key (with efficiency). – Ling Jan 29 '12 at 17:08
  • @EOL Probably a naive question: Can one iterate a heapq without destroying it (popping all the elements)? If one can't, I don't really see how a heapq would help (?) – Ling Jan 29 '12 at 17:10
  • @Ling: I had the same question too. :) An option is to copy the heap with `heap_for_sort = heap[:]` and then get the sorted elements with `heapq.heappop(heap_for_sort)`, so as to keep `heap` intact. – Eric O. Lebigot Jan 29 '12 at 22:45
2

The problem is that you need to sort or hash it by keys to get reasonable insert and lookup performance. A naive way of implementing it would be a value-sorted tree structure of entries, and a dict to lookup the tree position for a key. You need to get deep into updating the tree though, as this lookup dictionary needs to be kept correct. Essentially, as you would do for an updatable heap.

I figure there are too many options to make a resonable standard library option out of such a structure, while it is too rarely needed.

Update: a trick that might work for you is to use a dual structure:

  1. a regular dict storing the key-value pairs as usual

  2. any kind of sorted list, for example using bisect

Then you have to implement the common operations on both: a new value is inserted into both structures. The tricky part are the update and delete operations. You use the first structure to look up the old value, delete the old value from the second structure, then (when updating) reinsert as before.

If you need to know the keys too, store (value, key) pairs in your b list.

Update 2: Try this class:

import bisect
class dictvs(dict):
    def __init__(self):
        self._list = []

    def __setitem__(self, key, value):
        old = self.get(key)
        if old is None:
            bisect.insort(self._list, value)
            dict.__setitem__(self, key, value)
        else:
            oldpos = bisect.bisect_left(self._list, old)
            newpos = bisect.bisect_left(self._list, value)
            if newpos > oldpos:
                newpos -= 1
                for i in xrange(oldpos, newpos):
                    self._list[i] = self._list[i + 1]
            else:
                for i in xrange(oldpos, newpos, -1):
                    self._list[i] = self._list[i - 1]
            self._list[newpos] = value
            dict.__setitem__(self, key, value)

    def __delitem__(self, key):
        old = self.get(key)
        if old is not None:
            oldpos = bisect.bisect(self._list, old)
            del self._list[oldpos]
        dict.__delitem__(self, key)

    def values(self):
        return list(self._list)

It's not a complete dict yet I guess. I havn't tested deletions, and just a tiny update set. You should make a larger unit test for it, and compare the return of values() with that of sorted(dict.values(instance)) there. This is just to show how to update the sorted list with bisect

Ling
  • 535
  • 4
  • 9
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • As mentioned in other comments (including yours), appending a new value to the value list and then re-sort should be very fast in Python. I am not sure how efficient it is to update one value (in the middle of the list) and then re-sort---if it's also fast, then this solution (a dict + a sorted value list) would be perfect, as deletion happens rarely in my usage. – Ling Jan 29 '12 at 17:02
  • 1
    Updates are the tricky part. It depends a lot on the characteristics of your data whether or not "delete and reinsert" is better or "update and resort". An optimized implementation would compute the remove *and* the reinsert position, then only move the contents of the list within this interval. But obviously, this is at most twice as fast as the naive delete-and-reinsert using bisect. Resorting also needs at least `n-1` steps to check if the list is still sorted. – Has QUIT--Anony-Mousse Jan 29 '12 at 17:11
  • @Anony-Mousse: can't one make the deletion-insertion in your code much faster by using slice assignments? something like `self._list[oldpos:newpos] = self._list[oldpos+1:newpos] + [value]` (I have not checked that the indices are correct, but the idea should be clear). – Eric O. Lebigot Jan 29 '12 at 22:53
  • This probably depends on the kind of updates you receive. My code is optimized for the case that the values don't heavily change, so `newpos-oldpos` is very small. Slicing might result in large memory management cost, due to array duplication. – Has QUIT--Anony-Mousse Jan 30 '12 at 08:27
2

Here is another, simpler idea:

  • You create a class that inherits from dict.
  • You use a cache: you only sort the keys when iterating over the dictionary, and you mark the dictionary as being sorted; insertions should simply append to the list of keys.

kindall mention in a comment that sorting lists that are almost sorted is fast, so this approach should be quite fast.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • How do you remove/update the cache? When you destroy it on any such operation, you are essentially using `sorted(dict.values())`. – Has QUIT--Anony-Mousse Jan 29 '12 at 17:01
  • The cache is only updated when iterating over the dictionary *and* if it is obsolete; after creating it, subsequent sorts should be fast, as the cache is already partially sorted (kindall's remark). There is no need to destroy the cache, as values are appended to it at each insertion; during an append, the cache should be marked as being dirty (with a flag). – Eric O. Lebigot Jan 29 '12 at 22:39
  • Insertions are trivial. I'm talking about removing objects and updating values in the dictionary. – Has QUIT--Anony-Mousse Jan 30 '12 at 08:26
  • Objects can be deleted immediately, but this is expensive, with the advantage of a cache update: the keys can be sorted again (usually quickly since they are mostly sorted, and in O(n log n) otherwise), the deleted key+value can be found in O(log n) with `bisect`, and the deletion of the key and value lists themselves is O(n). However, deletions are probably not nearly as common as insertions, in my experience, so depending on the situation, this may not be an issue at all. Updating values can also trigger a cache update (sort) and then a bisect in O(log n) + simple update in O(1). – Eric O. Lebigot Jan 30 '12 at 18:48
1

You can use a skip dict. It is a Python dictionary that is permanently sorted by value.

Insertion is slightly more expensive than a regular dictionary, but it is well worth the cost if you frequently need to iterate in order, or perform value-based queries such as:

  1. What's the highest / lowest item?
  2. Which items have a value between X and Y?
malthe
  • 1,237
  • 13
  • 25