4

I have data arriving as dictionaries of lists. In fact, I read in a list of them...

data = [
    {
        'key1': [101, 102, 103],
        'key2': [201, 202, 203],
        'key3': [301, 302, 303],
    },
    {
        'key2': [204],
        'key3': [304, 305],
        'key4': [404, 405, 406],
    },
    {
        'key1': [107, 108],
        'key4': [407],
    },
]

Each dictionary can have different keys.

Each key associates to a list, of variable length.

What I'd like to do is to make a single dictionary, by concatenating the lists that share a key...

desired_result = {
    'key1': [101, 102, 103, 107, 108],
    'key2': [201, 202, 203, 204],
    'key3': [301, 302, 303, 304, 305],
    'key4': [404, 405, 406, 407],
}

Notes:

  • Order does not matter
  • There are hundreds of dictionaries
  • There are dozens of keys per dictionary
  • Totalling hundreds of keys in the result set
  • Each source list contains dozens of items

I can do this, with comprehensions, but it feels very clunky, and it's actually very slow (looping through all the possible keys for every possible dictionary yield more 'misses' than 'hits')...

{
  key: [
    item
      for d in data
        if key in d
      for item in d[key]
  ]
    for key in set(
      key
        for d in data
        for key in d.keys()
    )
}

# TimeIt gives 3.2, for this small data set

A shorter, easier to read/maintain option is just to loop through everything. But performance still sucks (possibly due to the large number of calls to extend(), forcing frequent reallocation of memory as the over-provisioned lists fill-up?)...

from collections import defaultdict
result = defaultdict(list)
for d in data:
  for key, val in d.items():
    result[key].extend(val)

# TimeIt gives 1.7, for this small data set

Is there a better way?

  • more 'pythonic'?
  • more concise?
  • more performant?

Alternatively, is there a more applicable data structure for this type of process?

  • I'm sort of making a hash map
  • Where each entry is guaranteed to have multiple collisions and so always be a list

Edit: Timing for small data set added

  • No timings for real world data, as I don't have access to it from here (err, ooops/sorry...)
MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • 1
    "But performance still sucks..." Please give timing results and indicate how far away they are from your expectations. – Karl Knechtel Nov 20 '21 at 14:41
  • What are the actual performances of your two versions? – qouify Nov 20 '21 at 14:42
  • Can there be duplicates between result lists for the same key? Can there be duplicates between result lists for different keys? Should duplicates be preserved when merging? – Karl Knechtel Nov 20 '21 at 14:44
  • What to in case a value appears in more than one list or is that impossible? – guidot Nov 20 '21 at 14:45
  • @guidot - They're lists, not sets, `[1, 2] + [2, 3]` should yield `[1, 2, 2, 3]` – MatBailie Nov 20 '21 at 14:46
  • @KarlKnechtel - Timings added for small dataset, I don't have access to the real world (large) data set here *(will do on Monday, oops/sorry)*. – MatBailie Nov 20 '21 at 14:54
  • @qouify - Timings added for small dataset, I don't have access to the real world (large) data set here *(will do on Monday, oops/sorry)*. – MatBailie Nov 20 '21 at 14:54
  • 1
    Just a simple idea: keep your for loop but replace `extend` with `append` and then using a second loop, merge all the lists of your dictionary with `itertools.chain.from_iterable`. Maybe it's more efficient than doing many calls to `extend`. – qouify Nov 20 '21 at 14:59
  • @qouify - For the small sample set it takes longer, but when I get my hands on the real data I expect the profiling to be completely different. Nice idea, thanks. – MatBailie Nov 20 '21 at 15:12
  • 2
    So that's 0.0000017 seconds and that "sucks"? Do I understand that correctly? – Kelly Bundy Nov 20 '21 at 18:00
  • @KellyBundy No, you don't understand correctly. See my final sentence. I acknowledge that i don't have representative timings for real world data, as I'm at home, not at work, and exploring this in my leisure time, and made the mistake of not having access to a representative profile. – MatBailie Nov 20 '21 at 19:18
  • @MatBailie Then how do you know that it sucks? – Kelly Bundy Nov 20 '21 at 19:30
  • @KellyBundy Because I ***did*** run it at work, at the end of a Friday. And then, God forgive me, I went home without taking notes or making plans to think about this on a Saturday. Hell, I know I should be punished, maybe even have my SO account frozen for eternity, but here we are. – MatBailie Nov 20 '21 at 19:35
  • Do you remember roughly how much it sucked? – Kelly Bundy Nov 20 '21 at 19:37
  • @KellyBundy It took longer to combine the dictionaries than to parse a jagged csv-like text file (the source data) from disk. That's all I recall. – MatBailie Nov 20 '21 at 19:41
  • I would've guessed it's faster than such parsing, though I guess depends on how you do that. Anyway, I just did a test with "large" data as you described (hundreds of dozens of dozens) and your combining code takes about 0.07 seconds. Do you do it very often, or why is that a problem? – Kelly Bundy Nov 20 '21 at 19:52
  • @kellybundy It's not a 'problem'. It's just my sense of intrigue promoted by a surprising observation (surely this can't Ever be slower than reading from disk?) My main agenda is to learn more, understand if I was doing something naively stupid, etc, not to solve a burning business issue. (I use python to orchestrate other micro services, it's rare that I care about raw throughput in python, but I was surprised enough to be interested in alternatives.) – MatBailie Nov 20 '21 at 19:57
  • Ok not "problem" but "sucks". I think it would be good to mention in the question that "sucks" means this comparison to parsing from disk. And of course ideally we'd have the parsing code and realistic data to test with :-). I'm not at all convinced it really is slower than the parsing. – Kelly Bundy Nov 20 '21 at 20:21
  • I tried merely *copying* my larger data with list/dict comprehensions (`[{k: [i for i in v] for k, v in d.items()} for d in data]`) and even that was slower (0.12 seconds vs the 0.07 seconds for combining). Really curious how you're parsing the data so that that's faster. – Kelly Bundy Nov 20 '21 at 21:14

3 Answers3

3
res = {}
for d in data:
    for k, v in d.items():
        # Adds `key1` to `res` with empty list, if `key1` is not there yet.
        # Extends list under `key1` with new portion of data contained in `v`.
        res.setdefault(k, []).extend(v)

Result:

{'key1': [101, 102, 103, 107, 108],
 'key2': [201, 202, 203, 204],
 'key3': [301, 302, 303, 304, 305],
 'key4': [404, 405, 406, 407]}

UPDATE: I would like to avoid comparison the performance of setdefault and defaultdict, but there was a dissusion in comments and I did some tests for that particular data, using python 3.10.

TL;DR: Setdefault faster defaultdict about 13% for that particular case.

The testing result:

defaultldict [1.633565943, 1.590108738, 1.6549000220000005, 1.622328843, 1.6121867709999993]
setdefault [1.4336988549999994, 1.4056579070000002, 1.4107502079999996, 1.408643755, 1.433823878]

The code of the test:

import timeit
from collections import defaultdict

data = [
    {
        'key1': [101, 102, 103],
        'key2': [201, 202, 203],
        'key3': [301, 302, 303],
    },
    {
        'key2': [204],
        'key3': [304, 305],
        'key4': [404, 405, 406],
    },
    {
        'key1': [107, 108],
        'key4': [407],
    },
]

def setdefault():
    res = {}
    for d in data:
         for k, v in d.items():
            res.setdefault(k, []).extend(v)

def default():
    res = defaultdict(list)
    for d in data:
        for k, v in d.items():
            res[k].extend(v)

if __name__ == '__main__':
    print('defaultldict', timeit.repeat(stmt=default, repeat=5, number=1000000, globals={'data': data}))
    print('setdefault', timeit.repeat(stmt=setdefault, repeat=5, number=1000000, globals={'data': data}))
mrvol
  • 2,575
  • 18
  • 21
  • 1
    Is there a reason you didn't go for `defaultdict`? – Julian Camilleri Nov 20 '21 at 19:02
  • 2
    sure. You have to import it and it is slower https://stackoverflow.com/a/58066584/1981384 – mrvol Nov 20 '21 at 19:14
  • @JulianCamilleri I agree with mrvol, investigating a preceding answer already demonstrated that `setdefault()` is marginally faster than using `defaultdict()`. – MatBailie Nov 20 '21 at 19:22
  • 1
    @mrvol Code Only answers are discouraged on SO. Please at least add a line (to the answer, not comments) stating that the only difference to the OP is the use of setdefault, as it has lower overheads (or something similar). This is for future readers (if there are any). – MatBailie Nov 20 '21 at 19:23
  • 1
    @mrvol That answer you link to does **not** show that it's slower. It does an improper comparison, as pointed out in the comments under it. – Kelly Bundy Nov 20 '21 at 19:27
  • @KellyBundy why not? It shows that. But anyway, here is another one - https://stackoverflow.com/questions/19629682/ordereddict-vs-defaultdict-vs-dict – mrvol Nov 20 '21 at 19:47
  • @mrvol Like I and the comments under it said: It does an improper comparison. It optimized the `setdefault` solution and compared it to a `defaultdict` solution **without** the same optimization. – Kelly Bundy Nov 20 '21 at 19:56
  • @KellyBundy Thank you for highlighting that. What's about the new link above? – mrvol Nov 20 '21 at 20:00
  • 2
    @mrvol That's a somewhat long page. Where shall I look there? The top answer concludes "The only truth: **It Depends!**" (though I only skimmed it). – Kelly Bundy Nov 20 '21 at 20:09
  • @mrvol Also, they did that 8.5 years ago, with Python 3.3. Speeds might've changed in the meantime. – Kelly Bundy Nov 20 '21 at 20:38
  • 1
    I upvoted the answer - I was just curious as to why you opted to setdefault, good choice! I like your solution ((: – Julian Camilleri Nov 21 '21 at 06:41
  • 1
    I appreciate the benchmark with the clear "for that particular case" :-). For my own [larger fake data](https://tio.run/##jVLdasMgFL73Kc5d4gihXSmMQJ4kBHHxZJVGDcaUdqXPnmlN2lDGmDf@nO/Po/3FHYzeffR2mqTqjXXgpELpSGuNAsu18NNcGbjqO4yVxnQdNk4aPSxlgS0fOydk4wg5jFpYFAOUsIc32G42RJhv1I@Dd3LES9hVbeJXV3lLoDUWJEgdfL8wXTRoTU68C9h47sU2lAjueKBfPbuYs6UBl0F0ogSWEYQ9LEjPwGD@AN7IGsh@TUD89WBAN98ypcWdZDHkukaFwBaBHbIVL/YZnO6lXDpUw8JfhtfJV@oeXdU0x7NDLdITjfZLh5sX@1Xn004Ojv6V5n9hqmO9dpctMKa5QsagLCFhTHGpGUuKh1U76iZoVs@U2aphtX/03dOmt1K7NHDyRdij/TdCkcYfmFvskUdMBnFT7jPQo/pEW24ppdP0Aw), `defaultdict` seems slightly faster. – Kelly Bundy Nov 21 '21 at 15:30
1

I have a solution that seems to achieve good results when the lists in your dictionaries are long. Although it is not the case in your situation I still mention it.

The idea as mentioned in my comments is to use append instead of extend in a first loop and then concatenate all lists in the resulting dictionary using itertools.chain. This is the function chain defined below. I also added @mrvol's code in my answer for comparison.

Here is the code:

import timeit
import itertools
from collections import defaultdict

REPEAT = 5    #  parameter repeat of timeit
NUMBER = 100  #  parameter number of timeit
DICTS = 100   #  number of dictionaries in our data
KEYS = 12     #  size of the dictionaries in our data
LEN = 1_000   #  size of the lists in our dictionaries

data = [
    {
        f'key{x}': [x * 100 + y for y in range(LEN)]
        for x in range(KEYS)
    }
    for _ in range(DICTS)
]


def setdefault():
    res = {}
    for d in data:
         for k, v in d.items():
            res.setdefault(k, []).extend(v)
    return res


def default():
    res = defaultdict(list)
    for d in data:
        for k, v in d.items():
            res[k].extend(v)
    return res


def chain():
    res = dict()
    for d in data:
         for k, v in d.items():
            res.setdefault(k, []).append(v)
    for key in res:
        res[key] = list(itertools.chain.from_iterable(res[key]))
    return res


#  check that all produce the same result
assert chain() == default()
assert setdefault() == default()


if __name__ == '__main__':
    for name, fun in [
            ('default', default),
            ('setdefault', setdefault),
            ('chain', chain)
    ]:
        print(name, timeit.repeat(
            stmt=fun, repeat=REPEAT, number=NUMBER,
            globals={'data': data}
        ))

All tests below have been made with python 3.10.

Here are the results:

default [3.0608591459999843, 2.9533347530000356, 3.204700414999934, 2.934139603999938, 2.854463246000023]
setdefault [2.7814459759999863, 2.801596405000055, 2.796927817000096, 2.797430740999971, 2.795393482999998]
chain [2.336767712999972, 2.33148793700002, 2.3378432869999415, 2.3322470529999464, 2.3312841169999956]

If we increase LEN to 10_000 the difference is more impressive:

default [33.63351462200012, 33.598145768999984, 33.83524595699987, 33.732721158000004, 33.785992579999856]
setdefault [33.658237180000015, 33.51113319399997, 33.25321677000011, 33.23780467200004, 33.467723277999994]
chain [23.47564513400016, 23.542697918999693, 23.520614959999875, 23.498439506000068, 23.582990831999723]

But with LEN=100, the chain function is sligthly slower:

default [0.20926385200004916, 0.23037391399998342, 0.21281876400007604, 0.21195233899993582, 0.21580142600009822]
setdefault [0.22843905199988512, 0.2232434430000012, 0.2187928880000527, 0.22453147500004889, 0.21708852799997658]
chain [0.24585279899997659, 0.23280389700016713, 0.2262972040000477, 0.24113659099998586, 0.2370573980001609]

So, once again, chain should not fit your needs as your lists tend to be small with dozens of items, but I mention this solution for the sake of completeness.

qouify
  • 3,698
  • 2
  • 15
  • 26
  • Running your whole code as-is on tio.run, I get different results: `default` and `setdefault` equally fast, and yours about 14% *slower*. What Python version are you using? – Kelly Bundy Nov 22 '21 at 16:10
  • @KellyBundy I'm using 3.9.2. You're right, it's slower on tio.run. What should I do? Shall I remove my answer if the results cannot be reproduced? – qouify Nov 22 '21 at 16:20
  • I'll try on 3.9.2 as well. I wouldn't remove it, no. If it's reproducibly faster for you, then that's something. Might be for others, too. And in any case, it's at least interesting. I'd use test data like the OP described, though, with their "hundreds/dozens". For example hundreds of keys, not the same ten for every dict. I've done that [here](https://stackoverflow.com/questions/70046832/collapse-a-list-of-dictionary-of-list-to-a-single-dictionary-of-list/70068370#comment123839237_70048988). – Kelly Bundy Nov 22 '21 at 16:25
  • On a GCE instance with 3.9.2 I'm getting about 1.59 for `default`, 1.60 for `setdefault`, and 1.53 for `chain`. I btw put a `for _ in range(3):` loop around it to make sure `chain` isn't the fastest because it's the last and the CPU is most warmed up or so. Results remained the same. – Kelly Bundy Nov 22 '21 at 16:40
  • Ooh, ouch. I just tried it with that larger fake data of my own, and results changed dramatically: The fastest from each of the three rounds: `default`: 3.59, 3.54, 3.34, `setdefault`: 3.65, 3.66, 3.31, `chain`: 7.68, 7.50, 6.58. (Don't know why the third round was so much faster for all solutions, I've run it again and the same thing happened again). – Kelly Bundy Nov 22 '21 at 16:50
  • Btw you're not using your `KEYS`. – Kelly Bundy Nov 22 '21 at 16:51
  • @KellyBundy Arggg! Thanks for point that. I'll retry tomorrow to update the output. Thnaks also for providing results with other data set. – qouify Nov 22 '21 at 18:55
  • @KellyBundy I fixed the code and relaunched tests on my personal computer (with python 3.10). – qouify Nov 23 '21 at 06:42
0

Perhaps you could use list comprehension and pandas.

No idea if this is a valid answer, or how it performs but it works, for the small set of example data anyway.

import pandas as pd

data = [
    {
        "key1": [101, 102, 103],
        "key2": [201, 202, 203],
        "key3": [301, 302, 303],
    },
    {
        "key2": [204],
        "key3": [304, 305],
        "key4": [404, 405, 406],
    },
    {
        "key1": [107, 108],
        "key4": [407],
    },
]

dics = [pd.DataFrame.from_dict(el, orient="index") for el in data]

dics_concat = pd.concat(dics).fillna("Empty")
dics_concat["key"] = dics_concat.index
dics_concat = dics_concat.groupby("key").agg(list)

combined = dics_concat.apply(
    lambda row: sorted(
        [item for sublist in row for item in sublist if item != "Empty"]
    ),
    axis=1,
)

print(combined.to_dict())
{'key1': [101, 102.0, 103.0, 107, 108.0], 
 'key2': [201, 202.0, 203.0, 204], 
 'key3': [301, 302.0, 303.0, 304, 305.0], 
 'key4': [404, 405.0, 406.0, 407]}

Without pd.concat.

df = pd.DataFrame(data).fillna("")

combined = df.apply(
    lambda row: sorted(
        [item for sublist in row for item in sublist if item != "Empty"]
    ),
    axis=0,
)

print(combined.to_dict())
norie
  • 9,609
  • 2
  • 11
  • 18
  • 1
    Significantly slower, for a variety of reasons, but an interesting thought. Something for me to play with... – MatBailie Nov 20 '21 at 15:21
  • Beside the point, but `all` is a bad variable name since it [shadows](https://en.wikipedia.org/wiki/Variable_shadowing) the [builtin `all()`](https://docs.python.org/3/library/functions.html#all). Use a more descriptive name, or at least something like `all_`. – wjandrea Nov 20 '21 at 19:56