Incorrect syntax for splitting a Python iterator into chunks

Question

I have some code that takes the Cartesian product of a list of lists of tuples, and then maps and casts the resulting iterator back to list for use by a subsequent function:

# Take the Cartesian product of a list of lists of tuples
groups = itertools.product(*list_of_lists_of_tuples)

# Mapping and casting to list is necessary to put in the correct format for a subsequent function
groups_list = list(map(list, groups))

This all works just fine in the abstract, but leads to a memory error when dealing with massive list sizes. It looks like itertools.product is already a generator; the memory bottleneck appears to be mapping and recasting. I was thinking that I might be able to get around this problem by splitting into chunks. Now the general question of how one splits a Python iterator into chunks has been asked many times on this site, and there appear to be many good answers, including but not limited to:

What is the most "pythonic" way to iterate over a list in chunks?

Python generator that groups another iterable into groups of N

Iterate an iterator by chunks (of n) in Python?

...but I think there must be some embarrassing flaw in how I'm understanding iterables and generators to begin with, because I can't seem to get any of them to work. For example, assuming a grouper function similar to what's seen in some of those other threads:

def grouper(self, it, n):
    iterable = iter(it)
    while True:
        chunks = itertools.islice(iterable, n)
        try:
            first_chunk = next(chunks)
        except StopIteration:
            return
        yield itertools.chain((first_chunk,), chunks)

...I was expecting the result to be chunks of my itertools.product object, which I could then operate on independently:

groups = itertools.product(*list_of_lists_of_tuples)

# create chunks of the iterator that can be operated on separately and then combined back into a list
groups_list = []
for x in self.grouper(groups, 100):
    some_groups_list = list(map(list, x))
    groups_list.extend(some_groups_list)

I'm getting empty lists. Something's obviously wrong, and - again - I think the main problem here is a lack of understanding on my end. Any suggestions would be greatly appreciated.

The memory issue would be materialising the result into a massive list. You didn't specify *why* you need to to have a list of lists, or how chunking would avoid this. Why not process each tuple as needed, one by one? — Martijn Pieters, Mar 25 '17 at 17:03
The `itertools.product` object is an absolute necessity. I need to somehow get from that object to the eventual `groups_list` object described above. How I get from A to B isn't important, and I'd like to find a memory-efficient solution. I can't process individual tuples; what I'm interested in is the Cartesian product of many lists of tuples. — jda, Mar 25 '17 at 17:08
Your `groups_list` won't be any more memory efficient than what you had before. It'll be *less* memory efficient, because now you have a intermediary lists too. You need to focus on actually processing the chunks, not adding them to a big list. — Martijn Pieters, Mar 25 '17 at 17:10
The problem is that I do need that big list. It's accessed many times later in the workflow. I can't simply do something with each chunk and then toss it aside. The memory-intensive part of the code is the mapping and recasting. I thought if I could do that part in smaller chunks it would be less memory intensive. I'm definitely open to ideas if there's a smarter way, but I need to have something like `groups_list` as an end state because a version of that list is vital for so many downstream processes. Apologies in advance if I'm missing something here. — jda, Mar 25 '17 at 17:20
There is no more efficient method of converting the tuples that the `product()` object produces to a list of lists (unless this is Python 2; then use `from future_builtins import map` to get the memory-efficient Python 3 version). You are not going to get a better result by grouping. — Martijn Pieters, Mar 25 '17 at 17:22
Thank you Martijn. Just to clarify - the only part of this code that is problematic from a memory efficiency standpoint is the mapping and recasting of that giant iterator. I had thought if I chunked the iterator and mapped/recasted each chunk in turn it would be less memory-intensive. You're saying this isn't the case, and that there is no more efficient way in Python 3.x to go from Point A (the `itertools.product` object) to Point B (the mapped/recasted list object) that how I'm doing it? — jda, Mar 25 '17 at 17:26
The memory issue is the number of list objects you are creating, all taking space on the heap. There are no intermediary objects being held anywhere. So yes, there is no more memory efficient way to go from your A to B. You need to find a way to process your results *later on* without needing the full product to exist in memory. — Martijn Pieters, Mar 25 '17 at 17:29

Incorrect syntax for splitting a Python iterator into chunks

0 Answers0