0

Given three dictionaries as follows {key: index}:

d1 = {'a':0, 'b':1,'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}
d2 = {'k':0, 'l':1,'m':2, 'n':3, 'o':4, 'p':5, 'q':6, 'r':7, 's':8, 't':9, 'u':10, 'v':11, 'w':12}
d3 = {'a':0, 'b':1,'c':2, 'd':3, 't': 4, 'u': 5, 'v': 6, 'w': 7, 'x': 8, 'y': 9, 'z': 10}

I would like to merge (combine?) them into one, using their unique keys and reindex their values.

Right now, I have the following working but inefficient solution for such small dict:

%timeit -r 10 -n 10000 d_merged = {key: idx for idx, key in enumerate( sorted(list(set([k for d in [d1, d2, d3] for k in d.keys()]))) ) }
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25}

# The slowest run took 4.36 times longer than the fastest. This could mean that an intermediate result is being cached.
# 34.6 µs ± 16.1 µs per loop (mean ± std. dev. of 10 runs, 10000 loops each)

I wonder if there probably exists a better and more efficient approach to do this for large dictionaries (more than 500k elements) without for loops?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Farid Alijani
  • 839
  • 1
  • 7
  • 25
  • 3
    What is the logic here? – mozway Aug 31 '23 at 08:58
  • 1
    Also how many dictionaries do you have in real-life? – mozway Aug 31 '23 at 09:09
  • What would the output look like if in one dictionary you have 'a':0 and in another 'a':1 ? How do you plan to "merge" that? – DarkKnight Aug 31 '23 at 09:10
  • @DarkKnight this is already the case, check for instance `t` or `u`, the values don't seem to matter here (this is also the first thing I checked :p). – mozway Aug 31 '23 at 09:11
  • 1
    @mozway If the values don't matter then I really don't understand the objective – DarkKnight Aug 31 '23 at 09:14
  • @mozway I have 720 dictionaries each of which contains 500K elements `{key: value}` – Farid Alijani Aug 31 '23 at 09:17
  • Do you know all the possible values of the keys in advance? Also, what are your current timings on the real data and what is the target? It shouldn't be too slow, 500K is not that much. – mozway Aug 31 '23 at 09:17
  • @mozway each dict is a vocabulary (Bag of Words) as `{'word': index}` which I'd like to merge in order to create one total vocabulary which is my `d_merged` in this question! for the total merged vocabulary, the indeces must be reindexed in order to have `0 ... nWords`: {"word_1": 0, ..., "word_n": n-1}` if we have zero indexed – Farid Alijani Aug 31 '23 at 09:21
  • How many different keys are there overall, i.e., how large is the result dictionary? – Kelly Bundy Aug 31 '23 at 13:52
  • Are the input dicts sorted and must the output dict be sorted? – Kelly Bundy Aug 31 '23 at 13:54
  • @KellyBundy they're all sorted keys, my `dict` is a Bag of Words in which `{'word': index}` – Farid Alijani Sep 01 '23 at 07:03

3 Answers3

2

Looks like you just want to get the keys in order and enumerate them. Using a set.union to merge the keys:

out = {k: i for i, k in enumerate(sorted(set().union(*(set(d) for d in [d1, d2, d3]))))}

This is ~20% faster than the original code.

Output:

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7,
 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14,
 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21,
 'w': 22, 'x': 23, 'y': 24, 'z': 25}

If you dictionary keys are already sorted, you could merge them with heapq.merge:

from heapq import merge

dicts = [d1, d2, d3]

out = {k: i for i, k in enumerate(dict.fromkeys(merge(*dicts)))}

# or, following suggestions from @KellyBundy
from itertools import count
out = dict(zip(dict.fromkeys(merge(*dicts)), count()))
mozway
  • 194,879
  • 13
  • 39
  • 75
1

Acknowledging I might be doing something wrong with my microbenchmark.

Using itertools.chain.from_iterable might be slightly faster than mozway@s's approach:

from itertools import chain, count
import timeit

def original_method():
    return {key: idx for idx, key in enumerate(sorted(list(set([k for d in [d1, d2, d3] for k in d.keys()]))))}

def mozway_method():
    return {k: i for i, k in enumerate(sorted(set().union(*(set(d) for d in [d1, d2, d3]))))}

def mrschneemann_method():
    return {key: idx for idx, key in zip(count(), sorted(set(key for d in [d1, d2, d3] for key in d.keys())))}

def sashsinha_method():
    return {k: i for i, k in enumerate(sorted(set(chain.from_iterable([d.keys() for d in [d1, d2, d3]]))))}

# Sample dictionaries for timing
d1 = {'a':0, 'b':1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}
d2 = {'k':0, 'l':1, 'm':2, 'n':3, 'o':4, 'p':5, 'q':6, 'r':7, 's':8, 't':9, 'u':10, 'v':11, 'w':12}
d3 = {'a':0, 'b':1, 'c':2, 'd':3, 't': 4, 'u': 5, 'v': 6, 'w': 7, 'x': 8, 'y': 9, 'z': 10}

# Time the methods
time_original = timeit.timeit(original_method, number=100000)
time_mozway = timeit.timeit(mozway_method, number=100000)
time_mrschneemann = timeit.timeit(mrschneemann_method, number=100000)
time_sashsinha = timeit.timeit(sashsinha_method, number=100000)
print(f'{time_original = }')
print(f'{time_mozway = }')
print(f'{time_mrschneemann = }')
print(f'{time_sashsinha = }')

Times from my M2 MacBook Air at least:

time_original = 0.545821041
time_mozway = 0.5003886249999999
time_mrschneemann = 0.6545290000000001
time_sashsinha = 0.480380166

Somewhat related Generator expressions vs. list comprehensions

Sash Sinha
  • 18,743
  • 3
  • 23
  • 40
1

If your input dicts are sorted:

from heapq import merge
from itertools import groupby, count
from operator import itemgetter

d1 = {'a':0, 'b':1,'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}
d2 = {'k':0, 'l':1,'m':2, 'n':3, 'o':4, 'p':5, 'q':6, 'r':7, 's':8, 't':9, 'u':10, 'v':11, 'w':12}
d3 = {'a':0, 'b':1,'c':2, 'd':3, 't': 4, 'u': 5, 'v': 6, 'w': 7, 'x': 8, 'y': 9, 'z': 10}
ds = d1, d2, d3

keys = map(itemgetter(0), groupby(merge(*ds)))
result = dict(zip(keys, count()))

print(result)

Attempt This Online!

Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • Thanks, `%timeit -r 10 -n 10000 result = dict(zip(map(itemgetter(0), groupby(merge(*ds))), count()))` returns `58.4 µs ± 22.9 µs per loop (mean ± std. dev. of 10 runs, 10000 loops each)` for this sample example :) – Farid Alijani Sep 01 '23 at 07:07
  • @FaridAlijani Why do you keep timing the irrelevant small example? That's misleading. – Kelly Bundy Sep 01 '23 at 09:35