2

I have a very large dictionary with entries of the form {(Tuple) : [int, int]}. For example, dict = {(1.0, 2.1):[2,3], (2.0, 3.1):[1,4],...} that cannot fit in memory.

I'm only interested in the top K values in this dictionary sorted by the first element in each key's value. If there a data structure that would allow me to keep only the largest K key-value pairs? As an example, I only want 3 values in my dictionary. I can put in the following key-value pairs; (1.0, 2.1):[2,3], (2.0, 3.1):[1,4], (3.1, 4.2):[8,0], (4.3, 4.1):[1,1] and my dictionary would be: (3.1, 4.2):[8,0], (1.0, 2.1):[2,3], (2.0, 3.1):[1,4] (in case of key-value pairs with the same first element, the second element will be checked and the largest key-value pair based on the second element will be kept)

Black
  • 4,483
  • 8
  • 38
  • 55
  • How did you create this dictionary? you want to do this at creating time or after creating the dictionary? – Mazdak Mar 04 '17 at 06:17
  • if you do not object to using `numpy` it has `partition` and `argpartition` which can find the top or bottom k in O(n). – Paul Panzer Mar 04 '17 at 06:19
  • Sorry, I should explain that I cannot keep my dictionary in memory. – Black Mar 04 '17 at 06:20
  • You'd just need the keys and arrays are more memory efficient than dicts. – Paul Panzer Mar 04 '17 at 06:24
  • @Kasramvd: I would like to do this at creating time. – Black Mar 04 '17 at 06:25
  • @PaulPanzer: Thanks for your suggestion. Could you please elaborate? I'm currently using a dictionary because there are some cases that keys will need to have their values updated. – Black Mar 04 '17 at 06:29
  • Hm, I'm not so sure anymore that efficient partitioning is your key problem. Anyway,if you could fit the keys in memory then you could either partition which would give you the top k keys directly (non sorted). And if your off-memory dict has efficient lookup you could just retrieve the matching values by key. If the latter is not possible then you could try and use `argpartition` which would give you the indices of the top k. You could then use these indices to recover both keys and values. It all depends also on your actual keys. If they are just two floats each they will map nicely to numpy – Paul Panzer Mar 04 '17 at 06:49
  • Why do I feel it's a bad idea to keep floating points as dictionary keys as they tend to lose precision.. – Abhishek J Mar 04 '17 at 06:59
  • Use [heapq.nlargest](https://docs.python.org/3/library/heapq.html#heapq.nlargest) with suitable `key` argument. You say that your dict does not fit in memory. How is it stored and how are you reading it? – Ilja Everilä Mar 04 '17 at 07:01
  • @IljaEverilä He's probably reading it from a very big file.. – Abhishek J Mar 04 '17 at 07:14
  • @IljaEverilä: I'm running a stochastic simulation that feeds/streams each key-value pair into a dictionary. These key-value pairs will sometimes have their values updated hence why I went with the `Dict` structure initially. – Black Mar 04 '17 at 07:20
  • Ah well, the updating part trips my suggestion. Thought that this was a single read. – Ilja Everilä Mar 04 '17 at 07:23
  • @abhishek-jebaraj Obviosly it could've been a file, but format matters when producing a [minimal, complete and verifiable example](http://stackoverflow.com/help/mcve). – Ilja Everilä Mar 04 '17 at 07:29

3 Answers3

0

If your data will not fit in memory, you need to be particularly mindful of how it's stored. Is it in a database, a flat file, a csv file, JSON, or what?

If it is in a "rectangular" file format, you might do well to simply use a standard *nix sorting utility, and then just read in the first k lines.

aghast
  • 14,785
  • 3
  • 24
  • 56
0
import heapq


class OnlyKDict(object):

    def __init__(self,K,key=lambda x:x):
        self.data = []
        self.dictionary = {}
        self.key=key         # Lambda function for the comparator
        self.K = K           # How many values to keep in dictionary

    def push(self,item):
        heapq.heappush(self.data,(self.key(item),item))
        self.dictionary[item[0]]=item[1]
        if len(self.data)>self.K:  #Size greater than k? pop minimum from heap and dict.
            item = self.pop()     #This ensure only k largest are there.
            self.dictionary.pop(item[0],None)

    def pop(self):
        return heapq.heappop(self.data)[1]

    def __getitem__(self,key):
        return self.dictionary[key]

    def __setitem__(self,key,value):
        if self.dictionary.has_key(key):
            self.dictionary[key] = value #If key present update value
        else:
            self.push((key,value))  ##Else push key and value as a tuple

h = OnlyKDict(8,lambda x:x[0][1] if x[0][1]==x[0][0] else x[0][0]) ##Compare 2nd value if both equal else compare 1st value only.

for i in xrange(10):
    h[(i,i)] = [i,i]

print h.dictionary

Output: {(5, 5): [5, 5], (6, 6): [6, 6], (4, 4): [4, 4], (7, 7): [7, 7], (9, 9): [9, 9], (8, 8): [8, 8], (2, 2): [2, 2], (3, 3): [3, 3]}

You can see how only the top 8 values are stored here.

Major stuff taken from heapq with custom compare predicate.

What we do is create our custom heap class which takes a key parameter where we specify on what value to sort.

The next is whenever this size is greater than 8 we pop the minimum item. This ensures we always have only the max 8 values.

Community
  • 1
  • 1
Abhishek J
  • 2,386
  • 2
  • 21
  • 22
  • Why not just use [`heapq.nlargest`](https://docs.python.org/3/library/heapq.html#heapq.nlargest) with `key=...`? – Ilja Everilä Mar 04 '17 at 06:54
  • No we only keep 8 values as that was the requirement.. Next he also wanted a dictionary to be returned. That's why the make_dict function.. – Abhishek J Mar 04 '17 at 06:56
  • Not just that he can't store all of the keys in memory as well.. That's why just the top 8.. He's mentioned that somewhere in the comments I guess.. – Abhishek J Mar 04 '17 at 07:03
  • The point of `nlargest` is exactly to take an (possibly huge, think file) iterable and to find the n largest items. Creating a dict of the results should be trivial. – Ilja Everilä Mar 04 '17 at 07:19
  • Could you elaborate on `nlargest`? From my perspective, the dictionary has to be entirely stored in memory then `nlargest` is called. – Black Mar 04 '17 at 07:23
  • @AbhishekJebaraj: I need to be able to update the values in the key-value pairs. It seems like I need to first stream in my tuples then make the dictionary. Would it be possible to update these key-value pairs before making the dictionary? – Black Mar 04 '17 at 07:32
  • But you would update only from the top k values then? – Abhishek J Mar 04 '17 at 07:33
  • It keeps a heap of n elements while reading an iterable, so it can process iterables larger than would fit in memory, but it's not useful in your case as you stream and update results. It's good for "find n largest items this one time from this huge iterable". – Ilja Everilä Mar 04 '17 at 07:35
  • Yep I'll update my answer for stream.. let him just answer if he'll update only top 8 values or any value.. – Abhishek J Mar 04 '17 at 07:36
  • @AbhishekJebaraj: Yes, I'll update only the 8 values. The reason for this is that after some threshold `K` the keys that aren't (probabilitistically) meaningful. – Black Mar 04 '17 at 07:41
  • Wow. Many thanks for this. I had to make a few changes to the code for my program but now works well – Black Mar 04 '17 at 09:38
0

Here is a customized OrderedDict which keeps the N largest keys for you :

from collections import OrderedDict
from operator import itemgetter


class LimitedSizeOrderedDict(OrderedDict):
    def __init__(self, *args, **kwds):
        self.maxlen = kwds.pop("maxlen", None)
        if args:
            try:
                top_n = sorted(*args, key=itemgetter(0, 0))[-self.maxlen:]
                self.min_key = top_n[0][0]
            except TypeError:
                raise Exception("keys should be in tuple format")
        else:
            self.min_key = (float("inf"), 0)
        super(LimitedSizeOrderedDict, self).__init__(top_n, **kwds)

    def __setitem__(self, key, value):
        if self._check_size():
            OrderedDict.__setitem__(self, key, value)
            if key[0] < self.min_key[0]:
                self.min_key = key
        elif key[0] > self.min_key[0]:
            self.pop(self.min_key)
            OrderedDict.__setitem__(self, key, value)
            self.min_key = min(self, key=itemgetter(0))

    def _check_size(self):
        if self.maxlen is not None:
            if len(self) < self.maxlen:
                return True
            return False
        return True

Demo:

In [2]: a = LimitedSizeOrderedDict([((7,2),3), ((2, 5), 3), ((6, 0), 1)], maxlen= 2)

In [3]: a
Out[3]: LimitedSizeOrderedDict([((6, 0), 1), ((7, 2), 3)])

In [4]: a[(12, 5)] = 10

In [5]: a
Out[5]: LimitedSizeOrderedDict([((7, 2), 3), ((12, 5), 10)])

In [6]: a[(10, 5)] = 9

In [7]: a
Out[7]: LimitedSizeOrderedDict([((12, 5), 10), ((10, 5), 9)])

In [8]: a[(0, 5)] = 9

In [9]: a
Out[9]: LimitedSizeOrderedDict([((12, 5), 10), ((10, 5), 9)])
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Does `top_n = sorted(args, itemgetter(0))[:self.maxlen]` mean I have to read in all my data? – Black Mar 04 '17 at 07:34
  • @Black No, it will return the top N items at the initialization time if you've passed any item to your dictionary at creation time. – Mazdak Mar 04 '17 at 08:40
  • @Black Checkout the update for a more comprehensive answer. – Mazdak Mar 04 '17 at 09:57