Separate one big dictionary into smaller dictionaries inside a list

Question

Let us say I have a dictionary with 1000 key-values

x = {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', ....}

I would like to convert it into

x = [{1: 'a', 2: 'b', 3: 'c', ...}, {10: 'z', 11: 'z', 12: 'z', ...}]

I am wondering if python has built-in function for this. Also my concern is on scaling.. Let us say I have 1 million key-values on a dictionary then I would like it to be separated via 1000 key-values in a list

If you are doing this in order to process the dictionaries in parallel then it means you don't really need a `dict` (you can't lookup anything with part of a dictionary). So you may do better to start with a list of tuples of (key, value) pairs, and then split that up via simple list indexing or a [grouping algorithm](https://stackoverflow.com/a/8991553/3830997). Or you could just pass the original list to `multiprocessing.Pool.map` with a [`chunksize`](https://stackoverflow.com/q/3822512/3830997) specified. For a lot of workflows, splitting the dicts will take longer than serial processing. — Matthias Fripp, Jul 20 '19 at 07:58
The dictionary came from aggregated datas and dictionary is the best data type for it — Dean Christian Armada, Jul 20 '19 at 08:20
You mean you're using the fact that dictionaries automatically keep only the last value added for a particular key? Even then, you may find it more efficient to chunk `x.items()` or pass it directly to the parallel `map` function, instead of chunking the items and then reassembling them into dictionaries. — Matthias Fripp, Jul 20 '19 at 08:25
I'm not keeping the last added value but continually summing up values for a particular key.. Think of it as this.. I have 1 million records and each person can have hundreds of records.. The key of the dictionary is the member id and the value is the amount that will be added to the database.. So 1 million records can be reduce up to around 200,000 records using this strategy with dictionary.. What you are suggesting I believe direct parallel processing which we already currently have but not efficient enough. — Dean Christian Armada, Jul 20 '19 at 08:33
Sorry, I leapt to an assumption about why you were using dicts in the first place. For the aggregation step, a dict makes a lot of sense. But after you have created the dict, it may be (slightly) more efficient to convert the dict to a list and process chunks of that directly rather than converting them back into dicts to process. But that depends on how your back end works. You may also want to consider bulk copying x.items() to the database as csv text, then using sql to add those rows to the people's records. That can be hundreds of times faster than updating rows from Python. — Matthias Fripp, Jul 20 '19 at 08:52
I am using django 2.2 so the bulk is already done by `bulk_update`.. Actually the answer below is really a good answer which is using generators.. I tried 10 million keys and it only results to 2 seconds worth of time.. — Dean Christian Armada, Jul 20 '19 at 09:00
That sounds great! I'm surprised that parallelizing the updates helps. I would have thought that would be I/O bound (by network card or disk access). Glad it worked for you. — Matthias Fripp, Jul 20 '19 at 09:17
Thanks @MatthiasFripp .. Basically, I did a combination of mapreduce producing "final result" and then parallel database update processing on the final result.. My real problem was to how parallelizing the final result output of the mapreduce which everybody hepled to solve — Dean Christian Armada, Jul 20 '19 at 09:54

Andrej Kesely · Accepted Answer · 2019-07-20T06:22:22.830

3

With that many values I would consider using generator to produce the chunks. It greatly depends what are you going to do with them (do you need all of them at the same time or you process one chunk at a time):

# create some dictionary
x = {i: 'z' + str(i) for i in range(1, 22+1)}

def get_chunks(x, size=10):
    out = {}
    for i, k in enumerate(x, 1):
        if i % size == 0:
            yield out
            out = {}
        out[k] = x[k]
    # last chunk:
    if out:
        yield out

for chunk in get_chunks(x):
    print(chunk)

Prints:

{1: 'z1', 2: 'z2', 3: 'z3', 4: 'z4', 5: 'z5', 6: 'z6', 7: 'z7', 8: 'z8', 9: 'z9'}
{10: 'z10', 11: 'z11', 12: 'z12', 13: 'z13', 14: 'z14', 15: 'z15', 16: 'z16', 17: 'z17', 18: 'z18', 19: 'z19'}
{20: 'z20', 21: 'z21', 22: 'z22'}

To put the results inside list:

print(list(get_chunks(x)))

edited Jul 20 '19 at 06:22

answered Jul 20 '19 at 05:57

Andrej Kesely

168,389
15
48
91

after I make them into list. I will do a `for` loop and do parallel processing with those data – Dean Christian Armada Jul 20 '19 at 06:04
"I would try to avoid loading all the data again to list if the dictionary is very large" - You are saying to just process it directly right? – Dean Christian Armada Jul 20 '19 at 06:22
@DeanChristianArmada Yes, trying to avoid creating copy of whole dictionary, loading only one chunk to memory at a time for each process. – Andrej Kesely Jul 20 '19 at 06:24
"loading only one chunk to memory at a time for each process" - This is what generator does right? – Dean Christian Armada Jul 20 '19 at 06:25
@DeanChristianArmada Something along these lines https://stackoverflow.com/questions/43078980/python-multiprocessing-with-generator – Andrej Kesely Jul 20 '19 at 06:26

blhsing · Answer 2 · 2019-07-22T21:20:15.440

3

You can use the grouper recipe from itertools (substitute 10 with any chunk size you desire):

list(map(dict, zip(*[iter(x.items())] * 10)))

If you are only going to iterate through the sequence of subdicts, however, you don't need the costly conversion to a list as your question suggests, in which case you can simply iterate through the iterable returned by the map function instead, so that it would be both time and memory-efficient:

for chunk in map(dict, zip(*[iter(x.items())] * 10)):
    print(chunk)

edited Jul 22 '19 at 21:20

answered Jul 20 '19 at 06:26

blhsing

91,368
6
71
106

But will this still be fast if I have 1 million data in my dictionary? – Dean Christian Armada Jul 20 '19 at 06:41
it will be fast and efficient until you apply the last `list` type conversion – Swadhikar Jul 20 '19 at 07:07
1

@DeanChristianArmada If you are only going to iterate through the sequence of subdicts, however, you don't need the costly conversion to a list as your question suggests, in which case you can simply iterate through the iterable returned by the `map` function instead, so that it would be both time and memory-efficient. I've updated my answer accordingly. – blhsing Jul 22 '19 at 21:21

score 1 · Answer 3 · answered Jul 20 '19 at 06:47

A straight forward, and super-ugly answer to your question is something like this:

import itertools

def slice_it_up(d, n):
    return [{x for x in itertools.islice(d.items(), i, i+n)} for i in range(0, len(d), n)]

d = {'key1': 1, 'key2': 2, 'key3': 3, 'key4': 4, 'key5': 5}
dd = slice_it_up(d, 3)

print(dd)

This prints

[{('key2', 2), ('key1', 1), ('key3', 3)}, {('key5', 5), ('key4', 4)}]

This is by any means not something that should be actually done, though. As the first answer already mentioned, you should really use generators to produce the chunks.

Since you've mentioned some kind of parallel processing (hope you aren't going to learn what python's GIL is at that stage, Google it, and see if you are going to be hit by that), at the very least you really don't have to aggregate the itertools.islice result (which is a generator) into a big fat list, but send those straight into processing instead.

Separate one big dictionary into smaller dictionaries inside a list

3 Answers3