I'm reading a file (while doing some expensive logic) that I will need to iterate several times in different functions, so I really want to read and parse the file only once.
The parsing function parses the file and returns an itertools.groupby
object.
def parse_file():
...
return itertools.groupby(lines, key=keyfunc)
I thought about doing the following:
csv_file_content = read_csv_file()
file_content_1, file_content_2 = itertools.tee(csv_file_content, 2)
foo(file_content_1)
bar(file_content_2)
However, itertools.tee
seems to only be able to "duplicate" the external iterator, while the internal (nested) iterator still refers to the original (hence it will be exhausted after iterating over the 1st iterator returned by itertools.tee
).
Standalone MCVE:
from itertools import groupby, tee
li = [{'name': 'a', 'id': 1},
{'name': 'a', 'id': 2},
{'name': 'b', 'id': 3},
{'name': 'b', 'id': 4},
{'name': 'c', 'id': 5},
{'name': 'c', 'id': 6}]
groupby_obj = groupby(li, key=lambda x:x['name'])
tee_obj1, tee_obj2 = tee(groupby_obj, 2)
print(id(tee_obj1))
for group, data in tee_obj1:
print(group)
print(id(data))
for i in data:
print(i)
print('----')
print(id(tee_obj2))
for group, data in tee_obj2:
print(group)
print(id(data))
for i in data:
print(i)
Outputs
2380054450440
a
2380053623136
{'name': 'a', 'id': 1}
{'name': 'a', 'id': 2}
b
2380030915976
{'name': 'b', 'id': 3}
{'name': 'b', 'id': 4}
c
2380054184344
{'name': 'c', 'id': 5}
{'name': 'c', 'id': 6}
----
2380064387336
a
2380053623136 # same ID as above
b
2380030915976 # same ID as above
c
2380054184344 # same ID as above
How can we efficiently duplicate a nested iterator?