2

I have a 1 column df with 37365 rows. I would need to separate it in chunks like the below:

df[0:2499]
df[2500:4999]
df[5000:7499]
...
df[32500:34999]
df[35000:37364]

The idea would be to use this in a loop like the below (process_operation does not work for dfs larger than 2500 rows)

while chunk <len(df):
    process_operation(df[lower:upper])

EDIT: I will be having different dataframes as inputs. Some of them will be smaller than 2500. What would be the best approach to also capture these?

Ej: df[0:1234] because 1234<2500
Javi Torre
  • 724
  • 8
  • 23
  • Most of the solutions to [How do you split a list into evenly sized chunks?](https://stackoverflow.com/q/312443/364696) and [What is the most “pythonic” way to iterate over a list in chunks?](https://stackoverflow.com/q/434287/364696) should apply here. Leaving this open as there might be a more clever way with dataframes. – ShadowRanger Mar 31 '21 at 12:29
  • And of course, having closed and unclosed the question, I can't close it now that I've found a proper duplicate [Pandas - Slice Large Dataframe in Chunks](https://stackoverflow.com/q/44729727/364696). – ShadowRanger Mar 31 '21 at 12:31
  • Your edit doesn't change things; Python slicing happily accepts non-existent end points; when `df.shape[0]` is `1234`, `df[0:2500]` gets the exact same result as `df[0:1234]`. – ShadowRanger Mar 31 '21 at 12:52
  • Great. Then I will accept Serge Ballesta's answer. – Javi Torre Mar 31 '21 at 13:22

3 Answers3

3

The range function is enough here:

for start in range(0, len(df), 2500):
    process_operation(df[start:start+2500])
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

Do you mean something like that?

lower = 0
upper = 2499

while upper <= len(df):
    process_operation(df[lower:upper])
    lower += 2500
    upper += 2500
0

I would use

import numpy as np
import math
    
chunk_max_size = 2500
chunks = int(math.ceil(len(df) / chunk_max_size)) 
for df_chunk in np.array_split(df, chunks):
    #where: len(df_chunk) <= 2500