3

I encounter some code that get back an iterative object from the Dynamo database, and I can do:

print [en["student_id"] for en in enrollments]

However, when I do similar things again:

print [en["course_id"] for en in enrollments]

Then the second iteration will print out nothing, because the iterative structure can only be iterated only once and it has reached its end.

The question is, how can we iterate it more than once, for the case of (1) what if it is known to be only several items in the iteration (2) what if we know there will be lots of items (say a million items) in the iteration, and we don't want to cost a lot of additional memory space?

Related is, I looked up rewind, and it seems like it exists for PHP and Ruby, but not for Python?

nonopolarity
  • 146,324
  • 131
  • 460
  • 740
  • The only other option bar storing all the data is `a, b = itertools.tee(it) ` but that is only useful if you are not using all/most of the data with with one iterator first, if that is the case you are better off with a list. – Padraic Cunningham Feb 20 '16 at 10:27

2 Answers2

8

enrollments is a generator. Either recreate the generator if you need to iterate again, or convert it to a list first:

enrollments = list(enrollments)

Take into account that APIs often use generators to avoid memory bloat; a list must have references to all objects it contains, so all those objects have to exist at the same time. A generator can produce the elements one by one, as needed; your list comprehension discards those objects again once the 'student_id' key has been extracted.

The alternative is to iterate just once, and do all the things with each object you want to do. So instead of running two list comprehensions, run one regular for loop and extract all the data you need in one place, appending to separate lists as you go along:

courses = []
students = []
for enrollment in enrollments:
    courses.append(enrollment['course_id'])
    students.append(enrollment['student_id'])

rewind in PHP is unrelated to this; Python has fileobj.seek(0) to do the same, but file objects are not generators.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • so I get it back from the Dynamo database call... how can I recreate it? (preferably not to make a call to the db again as it involves the network and db access) – nonopolarity Feb 20 '16 at 10:18
  • 2
    @太極者無極而生: by making the call again. – Martijn Pieters Feb 20 '16 at 10:18
  • interesting that, if we get back all data from a Dynamo db (should be similar to MongoDB), don't we occupy all the memory in RAM already? It is not like we are iterating over all permutations of 12 dice so we don't really need to store `6**12` tuples, in which case using a generator will save lots of memory – nonopolarity Feb 20 '16 at 10:25
  • I don't know how DynamoDB streams results. I imagine buffers (disk, network, etc) to be involved, it is not required to read everything into memory at once. – Martijn Pieters Feb 20 '16 at 10:27
  • interesting... so using `list(enrollments)` can potentially be troublesome if there are 30,000 * 5 = 150,000 enrollments (if there 30,000 students, and each student take an average number of 5 classes). But if `enrollments` is fetched by using a `studentID`, then it may typically contain 5 items and using `list(enrollments)` will be ok – nonopolarity Feb 20 '16 at 10:30
  • It depends. If this is a web server, how many other requests are active doing the same thing? How large are these objects? Take into account that creating memory allocations takes time too, so creating 150k objects takes longer if done in one chunk than creating them in a stream and memory can be reused. – Martijn Pieters Feb 20 '16 at 10:31
1
import itertools
it1, it2 = itertools.tee(enrollments, n=2) 

Looks like it is an answer from here: Why can't I iterate twice over the same data? But it is valid only if you are going to iterate not too much times.

Community
  • 1
  • 1
Paul
  • 6,641
  • 8
  • 41
  • 56
  • **Note**: this is *less efficient* in both time and space compared to just `list(enrollments)`. The only situation where this is better if you want to iterate at the same time. Like in `it1, it2 = tee(iterator, n=2); next(it1); for a,b in zip(it1, it2): # do stuff`. Here only two values will be kept in memory at every iteration. However if you first iterate over `it1` then all the values generated will be stored in a linked-list, making it equivalent to just calling `list(iterator)` (in fact less efficient as stated before). – Bakuriu Feb 20 '16 at 10:26
  • 1
    Do *not* use `itertools.tee()` if you are going to exhaust `a` before even starting `b`. Just use `list(it)` in that case. Only use `tee()` if you are mixing iteration over the teed output, to minimise the buffer it has to create. – Martijn Pieters Feb 20 '16 at 10:29