2

I have a Pandas data frame that has 51034 rows and 10 columns. I want to slice this data frame into 158 smaller data frames based on a list that contains the rows to slice.

How is it possible to slice a pandas data frame into smaller data frames?

For example, if I have a data frame with 10 rows and 4 columns:

      A    B    C    D
0     1    2    3    4
1     5    6    7    8
2     9    10   11   12
3     13   14   15   16
4     17   18   19   20
5     21   22   23   24
6     25   26   27   28
7     29   30   31   32
8     33   34   35   36
9     37   38   39   40

This example data frame will be sliced every 2 rows to create 5 new smaller data frames:

DataFrame1:

      A    B    C    D
0     1    2    3    4
1     5    6    7    8

DataFrame2:

      A    B    C    D
0     9    10   11   12
1     13   14   15   16

DataFrame3:

      A    B    C    D
0     17   18   19   20
1     21   22   23   24

DataFrame4:

      A    B    C    D
0     25   26   27   28
1     29   30   31   32

DataFrame5:

      A    B    C    D
0     33   34   35   36
1     37   38   39   40

I am not sure how to use the slice the larger data frame to create the smaller data frames.

Any suggestions on how to accomplish this goal?

Thank you.

Rodrigo

1 Answers1

2

You can use groupby with a simple index-to-group mapping function, assuming that the index is consecutive and starts from 0:

for _, df_k in df.groupby(lambda x: x/2):
    print df_k.reset_index(drop=True)

Output:

   A  B  C  D
0  1  2  3  4
1  5  6  7  8
    A   B   C   D
0   9  10  11  12
1  13  14  15  16
    A   B   C   D
0  17  18  19  20
1  21  22  23  24
    A   B   C   D
0  25  26  27  28
1  29  30  31  32
    A   B   C   D
0  33  34  35  36
1  37  38  39  40

If you have a list of numbers indicating the slicing positions, you can pass in a dictionary as the group mapping:

import numpy as np
slice_at = [3, 5]
group_sizes = np.diff([0] + slice_at + [len(df)])
mapping = dict(zip(df.index, np.repeat(range(len(group_sizes)), group_sizes)))
for _, df_k in df.groupby(mapping):
    print df_k.reset_index(drop=True)

Output:

   A   B   C   D
0  1   2   3   4
1  5   6   7   8
2  9  10  11  12
    A   B   C   D
0  13  14  15  16
1  17  18  19  20
    A   B   C   D
0  21  22  23  24
1  25  26  27  28
2  29  30  31  32
3  33  34  35  36
4  37  38  39  40
YS-L
  • 14,358
  • 3
  • 47
  • 58
  • Thanks YS-L. The one thing I haven't figured out is instead of printing the subset data frames, how can I create new data frame variables for each subset? – Rodrigo Matus-Nicodemos May 13 '15 at 10:46
  • 1
    You can assign the subset DataFrames to some variables, or put them inside a dict or list. They are copies of part of the original DataFrame. – YS-L May 13 '15 at 10:52