Large sortable data structures? Dictionary or something else?

Question

I have a large python dictionary (65535 key:value pairs), where key is range(0, 65536) and the values are integers.

The solution I found to sorting this data structure is posted here: Sort a Python dictionary by value

That solution works, but is not necessarily very fast.

To further complicate the problem, there is a potential for me to have many (thousands) of these dictionaries that I must combine prior to sorting. I am currently combining these dictionaries by iterating over the pairs in one dictionary, doing a key lookup in the other dictionary, and adding/updating the entry as appropriate.

This makes my question two fold:

1)Is a dictionary the right data structure for this problem? Would a custom tree or something else make more sense?

2)If dictionary is the smart, reasonable choice, what is the ideal way to combine multiples of the dictionary and then sort it?

One solution may be for me to redesign my program's flow in order to decrease the number of dictionaries being maintained to one, though this is more of a last resort.

Thanks

See also http://stackoverflow.com/questions/1491037/mapping-stdmap-to-python — Mikel, Feb 03 '11 at 22:50
The step in which you merge the dictionaries seems achievable just with an `A.update(B)`. — piro, Feb 04 '11 at 00:14
So I realize the more correct way to ask the question is "I need a value sorted dictionary where both the keys and values are integers." It needs to be key:value based because because I want to order based on the values, but return the key for analysis. — dicato, Feb 04 '11 at 15:27

score 2 · Accepted Answer · answered Feb 04 '11 at 01:13

2

A dictionary populated with 65535 entries with keys from the range (0:65536) sounds suspiciously like an array. If you need sorted arrays, why are you using dictionaries?

Normally, in Python, you would use a list for this type of data. In your case, since the values are integers, you might also want to consider using the array module. You should also have a look at the heapq module since if your data can be represented in this way, there is a builtin merge function that could be used.

In any case, if you need to merge data structures and produce a sorted data structure as a result, it is best to to use a merge algorithm and one possibility for that is a mergesort algorithm.

answered Feb 04 '11 at 01:13

Michael Dillon

31,973
6
70
106

I think you are right. I've been so "pythonic" lately (or at least that is my excuse) that I completely overlooked the fact that lists are most likely more appropriate. In this scenario, I ALWAYS have 1-65525 as 'keys' (though the set does not need to be complete, but that's not as important). I will do a little test and then make this as a the answer if it works for me. – dicato Feb 04 '11 at 02:06
Michael, please see my comment above. I realized that I NEED the keys to be coupled to the values so I can return them based on value ordering. – dicato Feb 04 '11 at 15:28
1

OK, I see the problem. But if you put tuples in the list, i.e. instead of [3687,3398,22,7464,541,2007] you would do [(0,3687),(1,3398),(2,22),(3,7464),(4,541),(5,2007)] then you can sort the list based on the values and not lose the original list index/key. – Michael Dillon Feb 04 '11 at 22:30
A list of tuples is what I am going to go with. Thanks for the insight. If I have an issues I will report them back! – dicato Feb 06 '11 at 21:50
(Rather odd that this answer seems to have been accepted for a comment suggesting the exact same thing I explained in detail a day before.) – Glenn Maynard Feb 08 '11 at 03:07

score 0 · Answer 2 · answered Feb 03 '11 at 23:12

There's not enough information here to say which data structure you should use, because we don't know what else you're doing with it.

If you need to be able to quickly insert records into the data structure later one at a time, then you do need a tree-like data structure, which unfortunately doesn't have a standard implementation (or even a standard interface, for some operations) in Python.

If you only need to be able to do what you said--to sort existing data--then you can use lists. The sorting is quick, especially if parts of the data are already sorted, and you can use binary searching for fast lookups. However, inserting elements will be O(n) rather than the O(log n) you'll get with a tree.

Here's a simple example, converting the dicts to a list or tuples, sorting the combined results and using the bisect module to search for items.

Note that you can have duplicate keys, showing up in more than one dict. This is handled easily: they'll be sorted together naturally, and the bisection will give you a [start, end) range containing all of those keys.

If you want to add blocks of data later, append it to the end and re-sort the list; Python's sorting is good at that and it'll probably be much better than O(n log n).

This code assumes your keys are integers, as you said.

dataA = { 1: 'data1', 3: 'data3', 5: 'data5', 2: 'data2' }
dataB = { 2: 'more data2', 4: 'data4', 6: 'data6' }

combined_list = dataA.items() + dataB.items()
combined_list.sort()
print combined_list

import bisect
def get_range(data, value):
    lower_bound = bisect.bisect_left(data, (value, ))
    upper_bound = bisect.bisect_left(data, (value+1, ))
    return lower_bound, upper_bound

lower_bound, upper_bound = get_range(combined_list, 2)
print lower_bound, upper_bound
print combined_list[lower_bound:upper_bound]

Both the keys and values are integers, and I need to sort on the values (which still works in your solution). I am going to do a little comparison of your answer and Michael Dillion's. The problem I see with yours, is it does not merge the data properly. I want values to be added together when the keys are the same. — dicato, Feb 04 '11 at 02:11
@Locker537: No, this puts identical keys side-by-side, which makes merging them simple. — Glenn Maynard, Feb 08 '11 at 03:01
You are correct. I jumped to conclusions when I saw adding of .items. Thanks! — dicato, Feb 08 '11 at 03:35

score 0 · Answer 3 · answered Feb 03 '11 at 23:13

0

With that quantity of data, I would bite the bullet and use the built in sqlite module. Yes you give up some python flexibility and have to use SQL, but right now its sorting 65k values; next it will be find values that meet certain criteria. So instead of reinventing relational databases, just go the SQL route now.

answered Feb 03 '11 at 23:13

WombatPM

2,561
2
22
22

This is severe overdesign if you don't actually need it. – Glenn Maynard Feb 03 '11 at 23:15
But when you look at what he wants to do with lookups and updates with thousands of dictionaries, I think a relational approach will be more supportable option going forward. The next person who has to use his code will at least have a fighting chance of understanding what is going on. – WombatPM Feb 03 '11 at 23:21
I agree that there is too much overhead and overdesign with this approach. Unfortunately, I am also looking to use minimal extra modules. – dicato Feb 04 '11 at 02:04

Large sortable data structures? Dictionary or something else?

3 Answers3