How to fast process 2xN list/nparray for each 2nd row value which has the same 1st row value?

Question

I have a list or numpy array like this:

[[3,   2,   1,   2,   3,   3  ],
 [3.1, 2.2, 1.1, 2.1, 3.3, 3.2]]

based on the same first-row value, they should be grouped as following lists:

[1.1], [2.1,2.2], [3.1,3.2,3.3]

for each list above I want to:

sum(abs(list - avg_list))

Besides finding all 2nd-row values which have the same 1st-row value one by one and then process them, could there be a parallel-process solution?

What I've tried is as follows:

a = np.sort(a)
a_0 = np.unique(a[0,:])

result = []
for b in a_0:
  a_1 = np.extract(a[0,:]==b,a[1,:])
  result.append(np.sum(np.abs(a_1-np.mean(a_1))))

Possibly related: https://stackoverflow.com/a/43094244/6189984 — bartolja, Oct 11 '21 at 07:50
Include what you have tried and how fast it is, my advise is first to measure — Dani Mesejo, Oct 11 '21 at 07:55
@DaniMesejo I have added my current solution part. By parallel I mean the approach which could be free from using forloop — XJY95, Oct 11 '21 at 08:01
@bartolja Thanks! Seems like the following operations could not be without forloop? — XJY95, Oct 11 '21 at 08:03
Show _**actual code**_ of what you have tried, not the "concept of code" you _think_ you tried. _"What I've tried is using np.where with the condition = value of 1st row and using for loop to find these lists"_ does not meet that requirement. And there's no `3.2` in your original list but somehow shows up in place of `3.1`. — aneroid, Oct 11 '21 at 08:20
@Sorry I have attached my code and correct the original example — XJY95, Oct 11 '21 at 16:34
I'm guessing your real concern is speed, not "parallel" perse, "vectorize" in numpy context normally means performing the task with compiled numpy methods, so you don't need to iterate in Python. But here you are collecting, at least as an intermediate step, lists (or arrays) that can very in length. That strongly indicates that a "pure" numpy approach isn't possible. To do a no-loops approach you have "think-outside-the-box". — hpaulj, Oct 11 '21 at 17:59

score 2 · Accepted Answer · answered Oct 11 '21 at 18:07

Here's a no-loop approach. I map the data onto a nan filled array using idx. Then use some of the np.nan... functions to perform the math in a way that excludes the nan.

In [102]: idx=np.array([3,   2,   1,   2,   3,   3  ])
In [103]: data=np.array([3.1, 2.2, 1.1, 2.1, 3.3, 3.2])
In [104]: res[np.arange(6),idx-1]=data
In [105]: res
Out[105]: 
array([[nan, nan, 3.1],
       [nan, 2.2, nan],
       [1.1, nan, nan],
       [nan, 2.1, nan],
       [nan, nan, 3.3],
       [nan, nan, 3.2]])
In [106]: np.nanmean(res, axis=0)
Out[106]: array([1.1 , 2.15, 3.2 ])
In [107]: res-np.nanmean(res, axis=0)
Out[107]: 
array([[           nan,            nan, -1.0000000e-01],
       [           nan,  5.0000000e-02,            nan],
       [ 0.0000000e+00,            nan,            nan],
       [           nan, -5.0000000e-02,            nan],
       [           nan,            nan,  1.0000000e-01],
       [           nan,            nan, -4.4408921e-16]])
In [108]: np.abs(res-np.nanmean(res, axis=0))
Out[108]: 
array([[          nan,           nan, 1.0000000e-01],
       [          nan, 5.0000000e-02,           nan],
       [0.0000000e+00,           nan,           nan],
       [          nan, 5.0000000e-02,           nan],
       [          nan,           nan, 1.0000000e-01],
       [          nan,           nan, 4.4408921e-16]])
In [109]: np.nansum(np.abs(res-np.nanmean(res, axis=0)), axis=0)
Out[109]: array([0. , 0.1, 0.2])

Mapping onto a 0 filled array might also work, since sum etc isn't bothered by excess 0s.

I can't vouch for the speed.

Your code with the missing result!

In [110]: a = np.sort(np.array((idx,data)))
     ...: a_0 = np.unique(a[0,:])
     ...: 
     ...: result = []
     ...: for b in a_0:
     ...:   a_1 = np.extract(a[0,:]==b,a[1,:])
     ...:   result.append(np.sum(np.abs(a_1-np.mean(a_1))))
In [111]: result
Out[111]: [0.0, 0.10000000000000009, 0.20000000000000018]

How to fast process 2xN list/nparray for each 2nd row value which has the same 1st row value?

1 Answers1