0

I have a pandas.DataFrame with several columns, some with continuous data others with categorical. I've been trying to first group by category and then within each category split into arrays based on a condition (namely value between two numbers)

Here is a brute force hackjob I wrote that does the job, but I was wondering if there was a more elegant way.

import pandas as pd

df = pd.DataFrame({'Category1' :    [ 0.3,  3.0, 12.4,  7.4,
                                             20.3, 15.0, 10.9, 17.4],
                            'Category2' :    [   0,    0,    1,    0,
                                                 1,    1,    0,    0],
                            'Category3' :    [   1,    2,    3,    4,
                                                 5,    6,    7,    8],
                            'Category4' :  ['foo','bar','fizz','buzz',
                                            'spam','nii','blah','lol'],
                             etc.,                                  })

group_0_5 = df['Category1']<=5.0
group_5_10 = (df['Category1']>5.0) & (df['Category1']<=10.0)
group_10_15 = (df['Category1']>10.0) & (df['Category1']<=15.0)
group_15_20 = (df['Category1']>15.0) & df['Category1']<=20.0)
group_20_25 = (df['Category1']>20.0) & (df['Category1']<=25.0)

state1 = (df['Category2']==1)
state2 = (df['Category2']==0)

count1_state1 = df.loc[group_0_5 & state1]['Category3'].count()
count2_state1 = df.loc[group_5_10 & state1]['Category3'].count()
count3_state1 = df.loc[group_10_15 & state1]['Category3'].count()
count4_state1 = df.loc[group_15_20 & state1]['Category3'].count()
count5_state1 = df.loc[group_20_25 & state1]['Category3'].count()

count1_state2 = df.loc[group_0_5 & state2]['Category3'].count()
count2_state2 = df.loc[group_5_10 & state2]['Category3'].count()
count3_state2 = df.loc[group_10_15 & state2]['Category3'].count()
count4_state2 = df.loc[group_15_20 & state2]['Category3'].count()
count5_state2 = df.loc[group_20_25 & state2]['Category3'].count()

count_array1=[count1_state1, count2_state1, count3_state1, count4_state1, count5_state1]

count_array2=[count1_state2, count2_state2, count3_state2, count4_state2, count5_state2]

print (count_array1)
print (count_array2)

Out [2]:
[nan, nan, 2, 1,   1]
[  2,   1, 1, 1, nan]
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135

3 Answers3

3

I think you need cut for binning with groupby by bins and by column Category2 with aggregating count and add missing values by reindex:

bins = [-np.inf, 5, 10, 15, 20, 25, np.inf]
bins = pd.cut(df['Category1'], bins=bins)

mux = pd.MultiIndex.from_product([bins.unique(), df['Category2'].unique()])
a = df.groupby([bins, df['Category2']])['Category3'].count().reindex(mux).unstack(0)
print (a)
   (-inf, 5]  (5, 10]  (10, 15]  (15, 20]  (20, 25]
0        2.0      1.0       1.0       1.0       NaN
1        NaN      NaN       2.0       NaN       1.0

#select by categories of column Category2
print (a.loc[0].values)
[  2.   1.   1.   1.  nan]

print (a.loc[1].values)
[ nan  nan   2.  nan   1.]

If need replace NaN to 0 add parameter fill_value=0 to reindex:

mux = pd.MultiIndex.from_product([bins.unique(), df['Category2'].unique()])
a = df.groupby([bins, df['Category2']])['Category3'].count()
      .reindex(mux, fill_value=0)
      .unstack(0)
print (a)
   (-inf, 5]  (5, 10]  (10, 15]  (15, 20]  (20, 25]
0          2        1         1         1         0
1          0        0         2         0         1

print (a.loc[0].values)
[2 1 1 1 0]

print (a.loc[1].values)
[0 0 2 0 1]

Also check What is the difference between size and count in pandas?

Graham
  • 7,431
  • 18
  • 59
  • 84
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks! The cut method is exactly what I was looking for to slim down this code. – John Ingles Apr 18 '17 at 06:21
  • any help on this - https://stackoverflow.com/questions/64685914/get-rows-based-on-a-condition-and-separate-them-into-subsets/64686033?noredirect=1#comment114373766_64686033 – pylearner Nov 04 '20 at 19:16
2

Using panda.cut() and pandas.DataFrame.groupby you can collect the elements as needed:

Code:

groups = df.groupby(pd.cut(df['Category1'], [0, 5, 10, 15, 20, 25]))

group_size = groups['Category2'].count().values
group_ones = groups['Category2'].sum().values

print(list(group_ones))
print(list(group_size - group_ones))

Results:

[0, 0, 2, 0, 1]
[2, 1, 1, 1, 0]
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
0

Again, pd.cut with groupby and set_index

df = df.groupby([pd.cut(df['Category1'], bins=bins, right = True), 'Category2']).Category3.count().reset_index()
df = df.set_index(['Category1', 'Category2']).unstack().reset_index(-1,drop=True)

count_array1 = df.loc[:, ('Category3', 1)].tolist()
print(count_array1)

[nan, nan, 2.0, nan, 1.0]


count_array2 = df.loc[:, ('Category3', 0)].tolist()
print(count_array2)

[2.0, 1.0, 1.0, 1.0, nan]
Vaishali
  • 37,545
  • 5
  • 58
  • 86