Pandas: Iterating over chunks more than once

Question

I've read a CSV into pandas in chunks:

loansTFR = pd.read_csv('loans_2007.csv', chunksize=3000)

I iterate over it like so:

for chunk in loansTFR:
    #run code

However, if I want to iterate over the chunks a second time with a second for loop, the code inside the loop isn't executed. The chunks have already iterated through and I cannot read through them a second time. Do I need to read the csv a second time to use another for loop?

You need to create another instance of the same iterator/chunk, since python doesn't allow generator to be copied unlike other objects (list, dictionary, etc). — ThePyGuy, Jun 08 '21 at 17:47
"However, if I want to iterate over the chunks a second time with a second for loop, the code inside the loop isn't executed. " yes, because *it returns an iterator*, which is good for a single pass. Just use `pd.read_csv('loans_2007.csv', chunksize=3000)` again. — juanpa.arrivillaga, Jun 08 '21 at 17:48
@Don'tAccept I'm not sure what you mean by "copied" here, but that doesn't sound correct — juanpa.arrivillaga, Jun 08 '21 at 17:50
@juanpa.arrivillaga, that may be because my grammar ain't so good. — ThePyGuy, Jun 08 '21 at 17:53
Surely there is a more time efficient way to iterate twice without running pd.read_csv multiple times? On a very large csv I imagine it would take too long — Griffin Hines, Jun 08 '21 at 17:56
@GriffinHines the *time efficiency* is exactly the same. But in any case, usually there is a memory/runtime trade-off. If you want this to be faster, then *don't load it in chunks*. Materialize the whole dataframe and you can always do what you want with it without having to read from the file again. You can't have your cake and eat it too — juanpa.arrivillaga, Jun 08 '21 at 17:57
You can reinitialise it multiple times using `tee` function from itertools. Refer this answer for more information. https://stackoverflow.com/questions/1271320/resetting-generator-object-in-python — Vedant Vasishtha, Jun 08 '21 at 18:02
@VedantVasishtha that doesn't help at all. You might as well create a list out of it at that point. — juanpa.arrivillaga, Jun 08 '21 at 18:04

score 0 · Answer 1 · answered Jun 08 '21 at 18:02

As others have told you the iterator has reached the end and will not reset thus you could make a copy of it beforehand like this:

loansTFR = pd.read_csv('loans_2007.csv', chunksize=3000)

chunks1 = loansTFR.get_chunks()
chunks2 = loansTFR.get_chunks()

for chunk in chunks1 :
    #run code
for chunk in chunks2 :
    #run code

or rerun the reading file

loansTFR = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in loansTFR :
    #run code

loansTFR = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in loansTFR :
    #run code

Both will create a copy of it in memory so you should not worry about it being more or less resources extensive (in broad terms), but a good practice would be to read the data in the original loop and make objects out of all the data (from a class of your own creation) and insert them into an array which then you can iterate as much as you need.

As well maybe you could combine the logic of both of your loops, the data will be the same in either run.

" but a good practice would be to read the data in the original loop and make objects out of all the data (from a class of your own creation) and insert them into an array which then you can iterate as much as you need" Um, not sure why you are saying that. Generally, whether or not it is good practice to put everything into a `list` out of an iterator **entirely depends on what you are doing and your performance tradeoffs** — juanpa.arrivillaga, Jun 08 '21 at 18:05
Well for this case where he wants to iterate over all of them more than once clearly it needs would be better than creating and deleting them all at once. Also in data science what you want to modify and play with is usually the initial data so formatting, giving functionality and making it accessible is a good trade off for some memory space (which is virtually endless) — Rudolf Fanchini, Jun 09 '21 at 01:37

Pandas: Iterating over chunks more than once

1 Answers1