0

I have a numpy array named "distances" which looks like this:

[[ 5.  1.  1.  1.  2.  1.  3.  1.  1.  1.]
[ 5.  4.  4.  5.  7. 10.  3.  2.  1.  1.]
 [ 3.  1.  1.  1.  2.  2.  3.  1.  1.  0.]
 [ 6.  8.  8.  1.  3.  4.  3.  7.  1.  1.]
 [ 4.  1.  1.  3.  2.  1.  3.  1.  1.  1.]
 [ 8. 10. 10.  8.  7. 10.  9.  7.  1.  1.]
 [ 1.  1.  1.  1.  2. 10.  3.  1.  1.  0.]
 [ 2.  1.  2.  1.  2.  1.  3.  1.  1.  0.]
 [ 2.  1.  1.  1.  2.  1.  1.  1.  5.  2.]
 [ 4.  2.  1.  1.  2.  1.  2.  1.  1.  1.]]

I want to make a new 3*9 numpy array by taking mean like this:

  1. If last column is 0, define an array c0 (1*9) which is mean of all such rows where last column is 0 where each column is mean of the columns from such rows.
  2. If last column is 1, define an array c1 (1*9) which is mean of all such rows where last column is 1 where each column is mean of the columns from such rows.
  3. If last column is 2, define an array c2 (1*9) which is mean of all such rows where last column is 2 where each column is mean of the columns from such rows.

Post doing this I am doing hstack to get final 3*9 array. I am sure this is the long approach but none the less wrong.

code:

c0=distances.mean(axis=1)

final = np.hstack((c0,c1,c2))

Doing this I get 1*10 array where each column is average of each column from distances array, however I am unable to find a way to do so on a condition that only take average when last column of rows is 0 only ?

R_Moose
  • 103
  • 9

2 Answers2

1

With pandas

Would be straight-forward with pandas -

import pandas as pd

df = pd.DataFrame(distances)
df_out = df.groupby(df.shape[1]-1).mean()
df_out['ID'] = df_out.index
out = df_out.values

With NumPy

Using Custom-function

For a NumPy-specific one, we can use groupbycol (perform group-based summations) and hence solve our case, like so -

sums  = groupbycol(distances, assume_sorted_col=False, colID=-1)
out = sums/np.bincount(distances[:,-1]).astype(float)[:,None]

With matrix-multiplication

mask = distances[:,-1,None] == np.arange(distances[:,-1].max()+1)
out = mask.T.dot(distances)/mask.sum(0)[:,None].astype(float)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • The output is not desired using this approach. for eg: c0 = [2, 1, 1.33, 1, 2.......] i.e for all rows where last column is 0, average out all the columns from those rows and make a new array having each column as averaged out value of those rows. – R_Moose Apr 10 '19 at 21:52
  • @R_Moose So, your code must be : `np.mean(distances[distances[:,-1]==0][::,0],axis=0)` and so on? – Divakar Apr 10 '19 at 21:56
  • yes, so this basically, np.mean(distances[distances[:,-1]==0][::,0]), gives me mean of column 0 of all rows where last columns is 0. I get one value. for my case I need to run this 9 times to get 9 such values and then make an array by stacking them vertically. I was looking for a simpler approach. – R_Moose Apr 10 '19 at 21:57
  • @R_Moose Please edit the question and put the correct code there. – Divakar Apr 10 '19 at 21:58
  • By my question I was trying to convey what I have tried and that if there is a simpler approach to do the same for all 9 columns at once. – R_Moose Apr 10 '19 at 21:59
0

I was able to do it like this:

c0= (distances[distances[:,-1] == 0][:,0:9]).mean(axis=0)
c1 = (distances[distances[:,-1] == 1][:,0:9]).mean(axis=0)
c2 = (distances[distances[:,-1] == 2][:,0:9]).mean(axis=0)
R_Moose
  • 103
  • 9