7

I have a very large dictionary with thousands of elements. I need to execute a function with this dictionary as parameter. Now, instead of passing the whole dictionary in a single execution, I want to execute the function in batches - with x key-value pairs of the dictionary at a time.

I am doing the following:

mydict = ##some large hash
x = ##batch size
def some_func(data):
    ##do something on data
temp = {}
for key,value in mydict.iteritems():
        if len(temp) != 0 and len(temp)%x == 0:
                some_func(temp)
                temp = {}
                temp[key] = value
        else:
                temp[key] = value
if temp != {}:
        some_func(temp)

This looks very hackish to me. I want to know if there is an elegant/better way of doing this.

nish
  • 6,952
  • 18
  • 74
  • 128
  • You could try [this (sub-dict from dict)](http://stackoverflow.com/a/25207481/1639625) or [this (split generator)](http://stackoverflow.com/a/24527424/1639625) – tobias_k Jan 19 '15 at 10:26

3 Answers3

16

I often use this little utility:

import itertools

def chunked(it, size):
    it = iter(it)
    while True:
        p = tuple(itertools.islice(it, size))
        if not p:
            break
        yield p

For your use case:

for chunk in chunked(big_dict.iteritems(), batch_size):
    func(chunk)
georg
  • 211,518
  • 52
  • 313
  • 390
  • Hi georg. Thanks for your answer. Please can you explain the performance of `chunked` method. Does is work better than the solution I have shared? – nish Jan 19 '15 at 10:37
  • @nish: I guess it should be efficient. `itertools` are written in C and are much faster than python. – georg Jan 19 '15 at 10:41
  • 1
    In Python3 it will be `big_dict.items() instead of big_dict.iteritems()` – Cherona Jul 04 '20 at 14:16
1

Here are two solutions adapted from earlier answers of mine.

Either, you can just get the list of items from the dictionary and create new dicts from slices of that list. This is not optimal, though, as it does a lot of copying of that huge dictionary.

def chunks(dictionary, size):
    items = dictionary.items()
    return (dict(items[i:i+size]) for i in range(0, len(items), size))

Alternatively, you can use some of the itertools module's functions to yield (generate) new sub-dictionaries as you loop. This is similar to @georg's answer, just using a for loop.

from itertools import chain, islice
def chunks(dictionary, size):
    iterator = dictionary.iteritems()
    for first in iterator:
        yield dict(chain([first], islice(iterator, size - 1)))

Example usage. for both cases:

mydict = {i+1: chr(i+65) for i in range(26)}
for sub_d in chunks2(mydict, 10):
    some_func(sub_d)
tobias_k
  • 81,265
  • 12
  • 120
  • 179
0

From more-itertools:

def chunked(iterable, n):
    """Break an iterable into lists of a given length::
        >>> list(chunked([1, 2, 3, 4, 5, 6, 7], 3))
        [[1, 2, 3], [4, 5, 6], [7]]
    If the length of ``iterable`` is not evenly divisible by ``n``, the last
    returned list will be shorter.
    This is useful for splitting up a computation on a large number of keys
    into batches, to be pickled and sent off to worker processes. One example
    is operations on rows in MySQL, which does not implement server-side
    cursors properly and would otherwise load the entire dataset into RAM on
    the client.
    """
    # Doesn't seem to run into any number-of-args limits.
    for group in (list(g) for g in izip_longest(*[iter(iterable)] * n,
                                                fillvalue=_marker)):
        if group[-1] is _marker:
            # If this is the last group, shuck off the padding:
            del group[group.index(_marker):]
        yield group
cizixs
  • 12,931
  • 6
  • 48
  • 60