1

I would like to split pandas dataframe to groups in order to process each group separately. My 'value.csv' file contains the following numbers

num tID y x height width
2   0   0   0   1   16
2   1   1   0   1   16 
5   0   1   0   1   16 
5   1   0   0   1   8 
5   2   0   8   1   8 
6   0   0   0   1   16 
6   1   1   0   1   8 
6   2   1   8   1   8
2   0   0   0   1   16
2   1   1   0   1   16 
5   0   1   0   1   16 
5   1   0   0   1   8 
5   2   0   8   1   8 
6   0   0   0   1   16 
6   1   1   0   1   8 
6   2   1   8   1   8

I would like to split the data based on the starting value of 0 at the tID column like that for the first 4 seperation.

First:

2   0   0   0   1   16
2   1   1   0   1   16 

Second:

5   0   1   0   1   16 
5   1   0   0   1   8 
5   2   0   8   1   8 

Third:

6   0   0   0   1   16 
6   1   1   0   1   8 
6   2   1   8   1   8

Fourth:

2   0   0   0   1   16
2   1   1   0   1   16 

For this, I tried to split it using if but no success, any efficient ideas?

    import pandas as pd
    statQuality = 'value.csv'
    df = pd.read_csv(statQuality, names=['num','tID','y','x','height','width'])


    df2 = df.copy()
    df2.drop(['num'], axis=1, inplace=True)

    x = []

    for index, row in df2.iterrows():
        if row['tID'] == 0:
            x = []
            x.append(row)
            print(x)
        else:
            x.append(row)
Mr. T
  • 11,960
  • 10
  • 32
  • 54
Cagri
  • 657
  • 2
  • 8
  • 17
  • I suggest use `d = dict(tuple(df.groupby('num')))` – jezrael Jul 09 '18 at 07:59
  • @jezrael Thanks good idea, but, another .csv file might contain repeated values for num. So it's not a good selector. – Cagri Jul 09 '18 at 08:07
  • So need this solution, only is necessqry remove `num` column in each DataFrame? – jezrael Jul 09 '18 at 08:08
  • 1
    Do you need `d = {k: v.drop('num', axis=1) for k, v in df.groupby('num')}` ? – jezrael Jul 09 '18 at 08:09
  • @jezrael thanks yes it is necessary to remove num because it does not have any function to split document. Other documents might have multiple values such as 0, 2, etc for num – Cagri Jul 09 '18 at 08:13
  • So `d = {k: v.drop('num', axis=1) for k, v in df.groupby('num')}` working nice? – jezrael Jul 09 '18 at 08:18
  • @jezrael the above example, yes it works nice. But when we have some duplicated values for num, it does not work nicely. – Cagri Jul 09 '18 at 08:20
  • Do you think last `6` insample data should be `2`, so need 2 different 2 groups? – jezrael Jul 09 '18 at 08:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174635/discussion-between-cagri-and-jezrael). – Cagri Jul 09 '18 at 08:25
  • @jezrael I have edited the csv file now. Thanks – Cagri Jul 09 '18 at 08:29

1 Answers1

1

Use:

#create groups by consecutive values
s = df['num'].ne(df['num'].shift()).cumsum()
#create helper count Series for duplicated groups like `2_0`, `2_1`...
g = s.groupby(df['num']).transform(lambda x: x.factorize()[0])
#dictionary of DataFrames
d = {'{}_{}'.format(i,j): v.drop('num', axis=1) for (i, j), v in df.groupby(['num', g])}
print (d)
{'2_0':    tID  y  x  height  width
0    0  0  0       1     16
1    1  1  0       1     16, '2_1':    tID  y  x  height  width
8    0  0  0       1     16
9    1  1  0       1     16, '5_0':    tID  y  x  height  width
2    0  1  0       1     16
3    1  0  0       1      8
4    2  0  8       1      8, '5_1':     tID  y  x  height  width
10    0  1  0       1     16
11    1  0  0       1      8
12    2  0  8       1      8, '6_0':    tID  y  x  height  width
5    0  0  0       1     16
6    1  1  0       1      8
7    2  1  8       1      8, '6_1':     tID  y  x  height  width
13    0  0  0       1     16
14    1  1  0       1      8
15    2  1  8       1      8}
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252