1

I am trying to group a dataframe based on the occurrence a variable. For example take this dataframe

   | col_1 | col_2
---------------------
 0 | 1     | 1
 1 | 0     | 1
 2 | 0     | 1
 3 | 0     | -1
 4 | 0     | -1
 5 | 0     | -1
 6 | 0     | NaN
 7 | -1    | NaN
 8 | 0     | NaN
 9 | 0     | -1
 10| 0     | -1
 11| 0     | -1

I want to group variable based on the current occurrence of a variable in column_2 to a dataframe and get the next sequence into another dataframe and likewise till the end of dataframe while also ignoring NaN.

So the final output would be like: ones_1 =

   | col_1 | col_2
---------------------
 0 | 1     | 1
 1 | 0     | 1
 2 | 0     | 1

mones_1 =

 3 | 0     | -1
 4 | 0     | -1
 5 | 0     | -1

mones_2 =

 9 | 0     | -1
 10| 0     | -1
 11| 0     | -1
Srikanth Varma
  • 175
  • 1
  • 10

2 Answers2

1

I suggest create dictionary of DataFrames:

#only non missing rows
mask = df['col_2'].notna()
#create unique groups
g = df['col_2'].ne(df['col_2'].shift()).cumsum()
#create counter of filtered g
g = g[mask].groupby(df['col_2']).transform(lambda x:pd.factorize(x)[0]) + 1
#map positive and negative values to strings and add counter values
g = df.loc[mask, 'col_2'].map({-1:'mones_',1:'ones_'}) + g.astype(str)
#generally groups
#g = 'val' + df.loc[mask, 'col_2'].astype(str) + ' no' + g.astype(str)
print (g)
0      ones_1
1      ones_1
2      ones_1
3     mones_1
4     mones_1
5     mones_1
9     mones_2
10    mones_2
11    mones_2
Name: col_2, dtype: object

#create dictionary of DataFrames
dfs = dict(tuple(df.groupby(g)))
print (dfs)
{'mones_1':    col_1  col_2
3      0   -1.0
4      0   -1.0
5      0   -1.0, 'mones_2':     col_1  col_2
9       0   -1.0
10      0   -1.0
11      0   -1.0, 'ones_1':    col_1  col_2
0      1    1.0
1      0    1.0
2      0    1.0}

#select by keys
print (dfs['ones_1'])
   col_1  col_2
0      1    1.0
1      0    1.0
2      0    1.0

It is not recommended, but possible create DataFrames by groups with variable names:

for i, g in df.groupby(g):
    globals()[i] =  g

print (ones_1)
   col_1  col_2
0      1    1.0
1      0    1.0
2      0    1.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

here is another logic (keeping them in dictionary is the idea again):

m=df[df.col_2.notna()] #filter out the NaN rows
#check if the index are in sequence along with that check if values changes per row
s=m.col_2.ne(m.col_2.shift())|m.index.to_series().diff().fillna(1).gt(1)
dfs={f'df_{int(i)}':g for i , g in df.groupby(s.cumsum())} #groupby and store in dict

Access the dataframes by accessing the keys:

print(dfs['df_1'])
print('---------------------------------')
print(dfs['df_2'])
print('---------------------------------')
print(dfs['df_3'])

   col_1  col_2
0      1    1.0
1      0    1.0
2      0    1.0
---------------------------------
   col_1  col_2
3      0   -1.0
4      0   -1.0
5      0   -1.0
---------------------------------
    col_1  col_2
9       0   -1.0
10      0   -1.0
11      0   -1.0
anky
  • 74,114
  • 11
  • 41
  • 70