Python divide dataframe into chunks

Question

I have a 1 column df with 37365 rows. I would need to separate it in chunks like the below:

df[0:2499]
df[2500:4999]
df[5000:7499]
...
df[32500:34999]
df[35000:37364]

The idea would be to use this in a loop like the below (process_operation does not work for dfs larger than 2500 rows)

while chunk <len(df):
    process_operation(df[lower:upper])

EDIT: I will be having different dataframes as inputs. Some of them will be smaller than 2500. What would be the best approach to also capture these?

Ej: df[0:1234] because 1234<2500

Most of the solutions to [How do you split a list into evenly sized chunks?](https://stackoverflow.com/q/312443/364696) and [What is the most “pythonic” way to iterate over a list in chunks?](https://stackoverflow.com/q/434287/364696) should apply here. Leaving this open as there might be a more clever way with dataframes. — ShadowRanger, Mar 31 '21 at 12:29
And of course, having closed and unclosed the question, I can't close it now that I've found a proper duplicate [Pandas - Slice Large Dataframe in Chunks](https://stackoverflow.com/q/44729727/364696). — ShadowRanger, Mar 31 '21 at 12:31
Your edit doesn't change things; Python slicing happily accepts non-existent end points; when `df.shape[0]` is `1234`, `df[0:2500]` gets the exact same result as `df[0:1234]`. — ShadowRanger, Mar 31 '21 at 12:52

score 3 · Accepted Answer · answered Mar 31 '21 at 12:22

3

The range function is enough here:

for start in range(0, len(df), 2500):
    process_operation(df[start:start+2500])

answered Mar 31 '21 at 12:22

Serge Ballesta

143,923
11
122
252

score 0 · Answer 2 · answered Mar 31 '21 at 12:19

0

Do you mean something like that?

lower = 0
upper = 2499

while upper <= len(df):
    process_operation(df[lower:upper])
    lower += 2500
    upper += 2500

answered Mar 31 '21 at 12:19

Deniz Polat

26
3

score 0 · Answer 3 · answered Mar 24 '22 at 11:46

0

I would use

import numpy as np
import math
    
chunk_max_size = 2500
chunks = int(math.ceil(len(df) / chunk_max_size)) 
for df_chunk in np.array_split(df, chunks):
    #where: len(df_chunk) <= 2500

answered Mar 24 '22 at 11:46

Vitaly Mirkis

1

Python divide dataframe into chunks

3 Answers3