8

I'm reading a file (while doing some expensive logic) that I will need to iterate several times in different functions, so I really want to read and parse the file only once.

The parsing function parses the file and returns an itertools.groupby object.

def parse_file():
    ...
    return itertools.groupby(lines, key=keyfunc)

I thought about doing the following:

csv_file_content = read_csv_file()

file_content_1, file_content_2 = itertools.tee(csv_file_content, 2)

foo(file_content_1)
bar(file_content_2)

However, itertools.tee seems to only be able to "duplicate" the external iterator, while the internal (nested) iterator still refers to the original (hence it will be exhausted after iterating over the 1st iterator returned by itertools.tee).

Standalone MCVE:

from itertools import groupby, tee

li = [{'name': 'a', 'id': 1},
      {'name': 'a', 'id': 2},
      {'name': 'b', 'id': 3},
      {'name': 'b', 'id': 4},
      {'name': 'c', 'id': 5},
      {'name': 'c', 'id': 6}]

groupby_obj = groupby(li, key=lambda x:x['name'])
tee_obj1, tee_obj2 = tee(groupby_obj, 2)

print(id(tee_obj1))
for group, data in tee_obj1:
    print(group)
    print(id(data))
    for i in data:
        print(i)

print('----')

print(id(tee_obj2))
for group, data in tee_obj2:
    print(group)
    print(id(data))
    for i in data:
        print(i)

Outputs

2380054450440
a
2380053623136
{'name': 'a', 'id': 1}
{'name': 'a', 'id': 2}
b
2380030915976
{'name': 'b', 'id': 3}
{'name': 'b', 'id': 4}
c
2380054184344
{'name': 'c', 'id': 5}
{'name': 'c', 'id': 6}
----
2380064387336
a
2380053623136  # same ID as above
b
2380030915976  # same ID as above 
c
2380054184344  # same ID as above

How can we efficiently duplicate a nested iterator?

DeepSpace
  • 78,697
  • 11
  • 109
  • 154
  • 1
    But if you tee the inner iterator, wouldn't you be reading the file twice? – Dani Mesejo Jan 01 '19 at 10:20
  • you'd probably be better off by hardcoding everything into lists. – Jean-François Fabre Jan 01 '19 at 10:23
  • 1
    It seems `grouped_object` even in `tee` can not be used twice. This parallel doesn't work: `tee_obj1, tee_obj2 = groupby_obj, groupby_obj`. But I guess this gives the expected result: `tee_obj1, tee_obj2 = copy.deepcopy(groupby_obj), groupby_obj`. I guess.. – iGian Jan 01 '19 at 10:30
  • "how to recursively copy an iterator" cannot be properly answered (or there is no solution) as discussed here https://stackoverflow.com/questions/42132731/how-to-create-a-copy-of-python-iterator but in your case it seems that `deepcopy` solves it. – Jean-François Fabre Jan 01 '19 at 10:46

1 Answers1

2

It seems like grouped_object (class 'itertools.groupby') be consumed once, even in itertools.tee. Also parallel assignement of the same grouped_object doesn't work:

tee_obj1, tee_obj2 = groupby_obj, groupby_obj

What's working is a deep copy of the grouped_object:

tee_obj1, tee_obj2 = copy.deepcopy(groupby_obj), groupby_obj
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
iGian
  • 11,023
  • 3
  • 21
  • 36
  • 2
    "It seems like grouped_objectct (class 'itertools.groupby') be consumed once, even in itertools.tee" I don't think this is true, otherwise `a b c` would not have been outputted the second time. I'll accept this answer though I was hoping to use something more elegant than `deepcopy`. Thanks! – DeepSpace Jan 01 '19 at 11:12
  • what is true is that the grouper objects returned as "values" of each groupby iteration cannot be tee'd. – Jean-François Fabre Jan 01 '19 at 15:20