I have data arriving as dictionaries of lists. In fact, I read in a list of them...
data = [
{
'key1': [101, 102, 103],
'key2': [201, 202, 203],
'key3': [301, 302, 303],
},
{
'key2': [204],
'key3': [304, 305],
'key4': [404, 405, 406],
},
{
'key1': [107, 108],
'key4': [407],
},
]
Each dictionary can have different keys.
Each key associates to a list, of variable length.
What I'd like to do is to make a single dictionary, by concatenating the lists that share a key...
desired_result = {
'key1': [101, 102, 103, 107, 108],
'key2': [201, 202, 203, 204],
'key3': [301, 302, 303, 304, 305],
'key4': [404, 405, 406, 407],
}
Notes:
- Order does not matter
- There are hundreds of dictionaries
- There are dozens of keys per dictionary
- Totalling hundreds of keys in the result set
- Each source list contains dozens of items
I can do this, with comprehensions, but it feels very clunky, and it's actually very slow (looping through all the possible keys for every possible dictionary yield more 'misses' than 'hits')...
{
key: [
item
for d in data
if key in d
for item in d[key]
]
for key in set(
key
for d in data
for key in d.keys()
)
}
# TimeIt gives 3.2, for this small data set
A shorter, easier to read/maintain option is just to loop through everything. But performance still sucks (possibly due to the large number of calls to extend()
, forcing frequent reallocation of memory as the over-provisioned lists fill-up?)...
from collections import defaultdict
result = defaultdict(list)
for d in data:
for key, val in d.items():
result[key].extend(val)
# TimeIt gives 1.7, for this small data set
Is there a better way?
- more 'pythonic'?
- more concise?
- more performant?
Alternatively, is there a more applicable data structure for this type of process?
- I'm sort of making a hash map
- Where each entry is guaranteed to have multiple collisions and so always be a list
Edit: Timing for small data set added
- No timings for real world data, as I don't have access to it from here (err, ooops/sorry...)