6

I have a pandas DataFrame that need to be fed in chunks of n-rows into downstream functions (print in the example). The chunks may have overlapping rows.

Let's start from a dummy DataFrame:

d = {'A':list(range(1000)), 'B':list(range(1000))}
df=pd.DataFrame(d)

In the case of a 2-rows chunks with 1-row overlap I have the following code:

a = df.index.values[:-1]
for i in a:
    print(df.iloc[i:i+2])

The output is something like this:

...
       A    B
996  996  996
997  997  997
       A    B
997  997  997
998  998  998
       A    B
998  998  998
999  999  999

Which is exactly what I want.

Is there a better/faster approach to iterate over chunks of n-rows of a pandas.DataFrame?

alec_djinn
  • 10,104
  • 8
  • 46
  • 71
  • @alec Your code has a "sliding window", where each chunk of output starts one row lower. If you had 10-row chunks, each row would appear in ten chunks. Is this what you really want, or do you need them non-overlapping – alexis Jun 18 '19 at 11:11
  • @alexis Right now, I need them to be overlapping. But if there is a general method do get both I am curious to know it. – alec_djinn Jun 18 '19 at 11:14

2 Answers2

9

Use DataFrame.groupby with integer division with helper 1d array created with same length like df - index values are not overlapped:

d = {'A':list(range(5)), 'B':list(range(5))}
df=pd.DataFrame(d)

print (np.arange(len(df)) // 2)
[0 0 1 1 2]

for i, g in df.groupby(np.arange(len(df)) // 2):
    print (g)

   A  B
0  0  0
1  1  1
   A  B
2  2  2
3  3  3
   A  B
4  4  4

EDIT:

For overlapping values is edited this answer:

def chunker1(seq, size):
    return (seq.iloc[pos:pos + size] for pos in range(0, len(seq)-1))

for i in chunker1(df,2):
    print (i)

   A  B
0  0  0
1  1  1
   A  B
1  1  1
2  2  2
   A  B
2  2  2
3  3  3
   A  B
3  3  3
4  4  4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Any solution for alternative index like `DatetimeIndex`? – alec_djinn Jun 18 '19 at 11:03
  • @alec_djinn - yes, it is `General index solution:` – jezrael Jun 18 '19 at 11:04
  • Nice! Thank you. Can you explain shortly the second method? I don't quite understand what the groupby is doing there. – alec_djinn Jun 18 '19 at 11:45
  • Nice! Thank you. Can you explain shortly the second method? I don't quite understand what the groupby is doing there. – alec_djinn Jun 18 '19 at 11:46
  • @alec_djinn - yes, `groupby` is function used for loop per groups like [this](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups), only groups are specified by `np.arange` with integer division instaed column `A`. – jezrael Jun 18 '19 at 11:48
3

Overlapping chunks generator function for iterating pandas Dataframes and Series

The chunk function with overlap parameter for control overlapping factor

A generator version of the chunk function with step parameter for control overlapping factor is presented below. Moreover this version works with custom index of the pd.DataFrame or pd.Series (e.g. float type index). For more convenience (to check overlapping), the integer index is used here.

   sz = 14
   # ind = np.linspace(0., 10., num=sz)
   ind = range(sz)

   df = pd.DataFrame(np.random.rand(sz,4),
                     index=ind,
                     columns=['a', 'b', 'c', 'd'])

   def chunker(seq, size, overlap):
       for pos in range(0, len(seq), size-overlap):
           yield seq.iloc[pos:pos + size] 

   chunk_size = 6
   chunk_overlap = 2
   for i in chunker(df, chunk_size, chunk_overlap):
       print(i)

   chnk = chunker(df, chunk_size, chunk_overlap)
   print('\n', chnk, end='\n\n')
   print('First "next()":', next(chnk), sep='\n', end='\n\n')
   print('Second "next()":', next(chnk), sep='\n', end='\n\n')
   print('Third "next()":', next(chnk), sep='\n', end='\n\n')

The output for the overlapping size = 2

          a         b         c         d
0  0.577076  0.025997  0.692832  0.884328
1  0.504888  0.575851  0.514702  0.056509
2  0.880886  0.563262  0.292375  0.881445
3  0.360011  0.978203  0.799485  0.409740
4  0.774816  0.332331  0.809632  0.675279
5  0.453223  0.621464  0.066353  0.083502
          a         b         c         d
4  0.774816  0.332331  0.809632  0.675279
5  0.453223  0.621464  0.066353  0.083502
6  0.985677  0.110076  0.724568  0.990237
7  0.109516  0.777629  0.485162  0.275508
8  0.765256  0.226010  0.262838  0.758222
9  0.805593  0.760361  0.833966  0.024916
           a         b         c         d
8   0.765256  0.226010  0.262838  0.758222
9   0.805593  0.760361  0.833966  0.024916
10  0.418790  0.305439  0.258288  0.988622
11  0.978391  0.013574  0.427689  0.410877
12  0.943751  0.331948  0.823607  0.847441
13  0.359432  0.276289  0.980688  0.996048
           a         b         c         d
12  0.943751  0.331948  0.823607  0.847441
13  0.359432  0.276289  0.980688  0.996048

 

First "next()":
          a         b         c         d
0  0.577076  0.025997  0.692832  0.884328
1  0.504888  0.575851  0.514702  0.056509
2  0.880886  0.563262  0.292375  0.881445
3  0.360011  0.978203  0.799485  0.409740
4  0.774816  0.332331  0.809632  0.675279
5  0.453223  0.621464  0.066353  0.083502

Second "next()":
          a         b         c         d
4  0.774816  0.332331  0.809632  0.675279
5  0.453223  0.621464  0.066353  0.083502
6  0.985677  0.110076  0.724568  0.990237
7  0.109516  0.777629  0.485162  0.275508
8  0.765256  0.226010  0.262838  0.758222
9  0.805593  0.760361  0.833966  0.024916

Third "next()":
           a         b         c         d
8   0.765256  0.226010  0.262838  0.758222
9   0.805593  0.760361  0.833966  0.024916
10  0.418790  0.305439  0.258288  0.988622
11  0.978391  0.013574  0.427689  0.410877
12  0.943751  0.331948  0.823607  0.847441
13  0.359432  0.276289  0.980688  0.996048
Andrei Krivoshei
  • 715
  • 7
  • 16