1

I have a very large CSV file that I read via iteration with pandas' chunks function. The problem: If e.g. chunksize=2, it skips the first 2 rows and the first chunks I receive are row 3-4.

Basically, if I read the CSV with nrows=4, I get the first 4 rows while chunking the same file with chunksize=2 gets me first row 3 and 4, then 5 and 6, ...

#1. Read with nrows  
#read first 4 rows in csv files and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], nrows=4)

print (reader)

01/01/2016 - 09:30 - A - 100
01/01/2016 - 13:30 - A - 110
01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

#2. Iterate over csv file with chunks
#iterate over csv file in chunks and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], chunksize=2)

for chunk in reader:

    #create a dataframe from chunks
    df = reader.get_chunk()
    print (df)

01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

Increasing chunksize to 10 skips first 10 rows.

Any ideas how I can fix this? I already got a workaround that works, I'd like to understand where I got it wrong.

Any input is appreciated!

David
  • 220
  • 1
  • 11
  • 1
    Don't call `get_chunk`. You already have your chunk since you're iterating over the reader, i.e. `chunk` is your DataFrame. Calling `print(chunk)` in your loop should print the first two rows. – root Sep 27 '16 at 19:11
  • Thanks a lot for the quick help, works like a charm. So 'get_chunk' basically gets me the next chunk already. Sorry for the newbie question, didn't understand this from the documentation. Do you want to post this as an answer so I can say it's correct and close this question? – David Sep 27 '16 at 19:29
  • @David, look at [this example](http://stackoverflow.com/a/39053748/5741205) - it might be helpful – MaxU - stand with Ukraine Sep 27 '16 at 20:25
  • @MaxU thanks, that made it very clear what to use get_chunk for. – David Sep 29 '16 at 17:09

1 Answers1

4

Don't call get_chunk. You already have your chunk since you're iterating over the reader, i.e. chunk is your DataFrame. Call print(chunk) in your loop, and you should see the expected output.

As @MaxU points out in the comments, you want to use get_chunk if you want differently sized chunks: reader.get_chunk(500), reader.get_chunk(100), etc.

root
  • 32,715
  • 6
  • 74
  • 87