Using itertools.tee to duplicate a nested iterator (ie itertools.groupby)

Question

I'm reading a file (while doing some expensive logic) that I will need to iterate several times in different functions, so I really want to read and parse the file only once.

The parsing function parses the file and returns an itertools.groupby object.

def parse_file():
    ...
    return itertools.groupby(lines, key=keyfunc)

I thought about doing the following:

csv_file_content = read_csv_file()

file_content_1, file_content_2 = itertools.tee(csv_file_content, 2)

foo(file_content_1)
bar(file_content_2)

However, itertools.tee seems to only be able to "duplicate" the external iterator, while the internal (nested) iterator still refers to the original (hence it will be exhausted after iterating over the 1^st iterator returned by itertools.tee).

Standalone MCVE:

from itertools import groupby, tee

li = [{'name': 'a', 'id': 1},
      {'name': 'a', 'id': 2},
      {'name': 'b', 'id': 3},
      {'name': 'b', 'id': 4},
      {'name': 'c', 'id': 5},
      {'name': 'c', 'id': 6}]

groupby_obj = groupby(li, key=lambda x:x['name'])
tee_obj1, tee_obj2 = tee(groupby_obj, 2)

print(id(tee_obj1))
for group, data in tee_obj1:
    print(group)
    print(id(data))
    for i in data:
        print(i)

print('----')

print(id(tee_obj2))
for group, data in tee_obj2:
    print(group)
    print(id(data))
    for i in data:
        print(i)

Outputs

2380054450440
a
2380053623136
{'name': 'a', 'id': 1}
{'name': 'a', 'id': 2}
b
2380030915976
{'name': 'b', 'id': 3}
{'name': 'b', 'id': 4}
c
2380054184344
{'name': 'c', 'id': 5}
{'name': 'c', 'id': 6}
----
2380064387336
a
2380053623136  # same ID as above
b
2380030915976  # same ID as above 
c
2380054184344  # same ID as above

How can we efficiently duplicate a nested iterator?

But if you tee the inner iterator, wouldn't you be reading the file twice? — Dani Mesejo, Jan 01 '19 at 10:20
you'd probably be better off by hardcoding everything into lists. — Jean-François Fabre, Jan 01 '19 at 10:23
It seems `grouped_object` even in `tee` can not be used twice. This parallel doesn't work: `tee_obj1, tee_obj2 = groupby_obj, groupby_obj`. But I guess this gives the expected result: `tee_obj1, tee_obj2 = copy.deepcopy(groupby_obj), groupby_obj`. I guess.. — iGian, Jan 01 '19 at 10:30
"how to recursively copy an iterator" cannot be properly answered (or there is no solution) as discussed here https://stackoverflow.com/questions/42132731/how-to-create-a-copy-of-python-iterator but in your case it seems that `deepcopy` solves it. — Jean-François Fabre, Jan 01 '19 at 10:46

score 2 · Accepted Answer · edited Jan 01 '19 at 11:15

2

It seems like grouped_object (class 'itertools.groupby') be consumed once, even in itertools.tee. Also parallel assignement of the same grouped_object doesn't work:

tee_obj1, tee_obj2 = groupby_obj, groupby_obj

What's working is a deep copy of the grouped_object:

tee_obj1, tee_obj2 = copy.deepcopy(groupby_obj), groupby_obj

edited Jan 01 '19 at 11:15

DeepSpace

78,697
11
109
154

answered Jan 01 '19 at 10:54

iGian

11,023
3
21
36

2

"It seems like grouped_objectct (class 'itertools.groupby') be consumed once, even in itertools.tee" I don't think this is true, otherwise `a b c` would not have been outputted the second time. I'll accept this answer though I was hoping to use something more elegant than `deepcopy`. Thanks! – DeepSpace Jan 01 '19 at 11:12
what is true is that the grouper objects returned as "values" of each groupby iteration cannot be tee'd. – Jean-François Fabre Jan 01 '19 at 15:20

Using itertools.tee to duplicate a nested iterator (ie itertools.groupby)

1 Answers1