Why does Pandas skip first set of chunks when iterating over csv in my code

Question

I have a very large CSV file that I read via iteration with pandas' chunks function. The problem: If e.g. chunksize=2, it skips the first 2 rows and the first chunks I receive are row 3-4.

Basically, if I read the CSV with nrows=4, I get the first 4 rows while chunking the same file with chunksize=2 gets me first row 3 and 4, then 5 and 6, ...

#1. Read with nrows  
#read first 4 rows in csv files and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], nrows=4)

print (reader)

01/01/2016 - 09:30 - A - 100
01/01/2016 - 13:30 - A - 110
01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

#2. Iterate over csv file with chunks
#iterate over csv file in chunks and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], chunksize=2)

for chunk in reader:

    #create a dataframe from chunks
    df = reader.get_chunk()
    print (df)

01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

Increasing chunksize to 10 skips first 10 rows.

Any ideas how I can fix this? I already got a workaround that works, I'd like to understand where I got it wrong.

Any input is appreciated!

Don't call `get_chunk`. You already have your chunk since you're iterating over the reader, i.e. `chunk` is your DataFrame. Calling `print(chunk)` in your loop should print the first two rows. — root, Sep 27 '16 at 19:11
Thanks a lot for the quick help, works like a charm. So 'get_chunk' basically gets me the next chunk already. Sorry for the newbie question, didn't understand this from the documentation. Do you want to post this as an answer so I can say it's correct and close this question? — David, Sep 27 '16 at 19:29
@David, look at [this example](http://stackoverflow.com/a/39053748/5741205) - it might be helpful — MaxU - stand with Ukraine, Sep 27 '16 at 20:25
@MaxU thanks, that made it very clear what to use get_chunk for. — David, Sep 29 '16 at 17:09

root · Accepted Answer · 2016-09-27T20:31:23.690

4

Don't call get_chunk. You already have your chunk since you're iterating over the reader, i.e. chunk is your DataFrame. Call print(chunk) in your loop, and you should see the expected output.

As @MaxU points out in the comments, you want to use get_chunk if you want differently sized chunks: reader.get_chunk(500), reader.get_chunk(100), etc.

edited Sep 27 '16 at 20:31

answered Sep 27 '16 at 20:02

root

32,715
6
74
87

you want to use `get_chunk()` if you want to read differently-sized-chunks: `reader.get_chunk(100); ... reader.get_chunk(500); ... reader.get_chunk(30); ...` – MaxU - stand with Ukraine Sep 27 '16 at 20:27
@MaxU: Thanks, that makes more sense. Updated the answer. – root Sep 27 '16 at 20:32

Why does Pandas skip first set of chunks when iterating over csv in my code

1 Answers1