Efficient way to develop a dictionary for reverse lookup?

Question

Let's say I have a dictionary with the following contents:

old_dict = {'a':[0,1,2], 'b':[1,2,3]}

and I want to obtain a new dictionary where the keys are the values in the old dictionary, and the new values are the keys from the old dictionary, i.e.:

new_dict = {0:['a'], 1:['a','b'], 2:['a','b'], 3:['b']}

To perform this task, I'm currently using the following example code:

# get all the keys for the new dictionary
new_keys = np.unique(np.hstack([old_dict[key] for key in old_dict]))

# initialize new dictionary
new_dict = {key: [] for key in new_keys}
# step through every new key
for new_key in new_keys:
    # step through every old key and check if the new key the current list of values
    for old_key in old_dict:
        if new_key in old_dict[old_key]:
            new_dict[new_key].append(old_key)

In this example I'm showing 2 old keys and 4 new keys, but for my problem I have ~10,000 old keys and ~100,000 new keys. Is there a smarter way to perform my task, maybe with some tree-based algorithm? I used dictionaries because they are easier for me to visualize the problem, but dictionaries can be necessary if there are better data types for this exercise.

In the meantime, I'm looking into documentations for reverse lookup of dictionaries, and trying to manipulate this using sindex from geopandas.

Andrej Kesely · Accepted Answer · 2023-03-22T20:21:17.350

You can try:

old_dict = {'a':[0,1,2], 'b':[1,2,3]}

new_dict = {}
for k, v in old_dict.items():
    for i in v:
        new_dict.setdefault(i, []).append(k)

print(new_dict)

Prints:

{0: ['a'], 1: ['a', 'b'], 2: ['a', 'b'], 3: ['b']}

Benchmark:

import numpy as np
from timeit import timeit

old_dict = {'a':[0,1,2], 'b':[1,2,3]}


def f1():
    new_dict = {}
    for k, v in old_dict.items():
        for i in v:
            new_dict.setdefault(i, []).append(k)
    return new_dict

def f2():
    # get all the keys for the new dictionary
    new_keys = np.unique(np.hstack([old_dict[key] for key in old_dict]))

    # initialize new dictionary
    new_dict = {key: [] for key in new_keys}
    # step through every new key
    for new_key in new_keys:
        # step through every old key and check if the new key the current list of values
        for old_key in old_dict:
            if new_key in old_dict[old_key]:
                new_dict[new_key].append(old_key)
    return new_dict


t1 = timeit('f1()', number=1000, globals=globals())
t2 = timeit('f2()', number=1000, globals=globals())

print(t1)
print(t2)

Prints:

0.0005186359921935946
0.009738252992974594

With old_dict initialized with (dict has now 10648 items):

from itertools import product
from random import randint

k = 'abcdefghijkloprstuvwyz'
old_dict = {''.join(c): list(range(randint(1, 3), randint(4, 10))) for c in product(k, k, k)}
print(len(old_dict))

Prints:

10648

3.126827526008128
19.222182962010265

The timing might be misleading, as the constant overhead might be different. For best comparison, it should be done with a larger input dict, rather than 1000 times with a small dict. — user_na, Mar 22 '23 at 20:17

Efficient way to develop a dictionary for reverse lookup?

1 Answers1