In-place dictionary inversion in Python

Question

I need to invert a dictionary of lists, I don't know how to explain it in English exactly, so here is some code that does what I want. It just takes too much memory.

def invert(oldDict):
    invertedDict = {}
    for key,valuelist in oldDict.iteritems():
        for value in valuelist:
            try:
                entry = invertedDict[value]
                if key not in entry:
                    entry.append(key)
            except KeyError:
                invertedDict[value] = [key]
    return invertedDict

The original is a dict of lists, and the result is a dict of lists. This "inverts" it.

test = {}
test[1] = [1999,2000,2001]
test[2] = [440,441]
test[3] = [440,2000]

print invert(test)

This gives:

{2000: [1, 3], 2001: [1], 440: [2, 3], 441: [2], 1999: [1]}

I need to know if this can be done in-place, because my current strategy is exceeding the amount of physical memory on my machine with the dictionary I am working with. Can you think of a way to do it with generators?

I didn't know of shelve, thanks. I guess that neither the old or the new dictionaries need to be fully loaded to operate on them? — Nathan, Aug 05 '10 at 19:12
shelve only works with string keys. You can work around it though — John La Rooy, Aug 05 '10 at 19:14

John La Rooy · Accepted Answer · 2010-08-05T19:28:20.010

5

This doesn't do it in place, but consumes oldDict by using popitem()

from collections import defaultdict
def invert(oldDict):
    invertedDict = defaultdict(list)
    while oldDict:
        key, valuelist = oldDict.popitem()
        for value in valuelist:
            invertedDict[value].append(key)
    return invertedDict

I have a feeling that dict's are never resized unless the size increases, so you may need to add+remove a dummy item periodically. See Shrinkage rate

from collections import defaultdict
def invert(oldDict):
    invertedDict = defaultdict(list)
    i=0
    while oldDict:
        key, valuelist = oldDict.popitem()
        for value in valuelist:
            invertedDict[value].append(key)
        i+=1
        if i%1000==0: # allow the dict to release memory from time to time
            oldDict[None]=None
            del oldDict[None]
    return invertedDict

edited Aug 05 '10 at 19:28

answered Aug 05 '10 at 19:09

John La Rooy

295,403
53
369
502

Yeah, that's exactly what I wanted to suggest to. Remove objects from the old dictionary and this way you should keep the memory usage pretty constant (at least when garbage collection will take place). – gruszczy Aug 05 '10 at 19:10
Thats a clever way to force the dict to resize. – Nathan Aug 05 '10 at 19:28
I have this method running now, and it's looking good so far. – Nathan Aug 05 '10 at 19:30
And thank you for noticing that 'if key not in entry:' was unnecessary, that's a plus. – Nathan Aug 05 '10 at 19:36

score 2 · Answer 2 · answered Aug 05 '10 at 23:40

It probably takes many millions of entries to run out of RAM on modern machine if the algorithm is correct. Assuming this, you have to use some persistent storage for the data to process only chunk at a time. Why not use simple database table with 2 columns to store the dict?

Then you can use either column as a key by selecting with order by on needed column and grouping values from other column with simple python code.

I think I'll use shelve in the future, but for now, gnibbler's trick actually worked. — Nathan, Aug 06 '10 at 12:37

score 1 · Answer 3 · answered Aug 05 '10 at 19:11

I actually don't see any way the memory usage of your current algorithm could be significantly improved on. You do use iterators rather than creating new lists/dicts outright, so the only significant memory usage comes from the original dictionary and the new inverted dictionary.

If you don't have enough RAM to run this algorithm with the dictionary you're actually using, all I can think of is to somehow avoid keeping both the original dict and the inverted dict in memory at the same time. One way to do that would be to remove items from the original dict as you add them to the inverted dict, which could be done like this:

def invert(old_dict):
    inverted = collections.defaultdict(list)
    while old_dict:
        k,v = old_dict.popitem()
        for vi in v:
            inverted[vi].append(k)
    return inverted

(notice that I also used defaultdict to simplify the code, but if you really need a pure dict, not a subclass, you could do something like what you had originally with the try/except)

If you want to keep both the original and inverted dictionaries available after the algorithm completes, all I can think of is to store them in disk files, and find some way to only load a piece at a time. I don't know of any standard Python module that is able to store a dict to disk and load only a piece of it at a time, so you might have to write your own code for that.

score 0 · Answer 4 · answered Aug 05 '10 at 20:17

I don't have a direct answer. Here is some of my thought.

I think what you want to do can be called Inverted index
I don't believe it can be done in-place, nor do I think it is the right strategy. You should look at disk based solution. Perhaps sort or organized your original data structure, write it out to one or more files, then read it back and merge them into your final data structure.

In-place dictionary inversion in Python

4 Answers4

Linked

Related