Aggegate min max values in one column based on another column of array - Python

Question

I am trying to concatenate the values if they have the same indices. I am working with rectangular shape so I know:

There will always at least 2 of the same indices.
If there are more than 2 indices, I just need to store the maxs and mins.

Basically,

From:

a = array([
       [ 1,  5],
       [ 1,  7],
       [ 2,  8],
       [ 2, 10],
       [ 2, 22],
       [ 3, 55],
       [ 3, 77]])

To:

b = np.array([
       [ 1, 5, 7],
       [ 2, 8, 22], # [2,8,10,22] but the min is 8 and max is 22
       [ 3, 55, 77]])

I have tried to convert it to a list and going through each value using a for loop but it takes a considerable amount of time.

I've also tried sorting the array, np.sort(a, axis=0) and taking every other row, but since there can be more than two of the indices, it fails.

I am new to numpy, so don't know what else to try.

Any and all suggestion would be helpful, Thank You.

Edit: Its behavior is like a dictionary where the keys are a[0] and values are a[1:]

If there are more than 2 values, I only keep the min and max.

Just to be clear what you intend, could you show your list code? — hpaulj, Nov 18 '19 at 04:45
@hpaulj I am really sorry, I mistyped the second array, it's fixed now. I am basically trying to replicate a dictionary in numpy where the keys are the a[0] and the values are a[1:] — Abrian Abir, Nov 18 '19 at 05:11

score 2 · Answer 1 · answered Nov 18 '19 at 05:20

2

A way to do this using pandas

import pandas as pd
# create a dataframe with 2 columns corresponding to the columns of a
df = pd.DataFrame({ 'indices':a[:,0],'values':a[:,1]}) 
# compute min and max by indices
df2 = df.groupby('indices').agg({'values': ['min', 'max']}).reset_index()
# convert to numpy array
np.asarray(df2)
#array([[ 1,  5,  7],
#       [ 2,  8, 22],
#       [ 3, 55, 77]], dtype=int64)

answered Nov 18 '19 at 05:20

fmarm

4,209
1
17
29

Thank you for your quick response, I am trying to contain my project within numpy. Is it possible to replicate the same thing in numpy? if not, I'll go ahead and use pandas. – Abrian Abir Nov 18 '19 at 05:29
@MohammadIslam may i know the reason, why the numpy answers wasn't the accepted answer, where as a pandas based answer is ? Just to be sure incase if I can correct something – venkata krishnan Nov 18 '19 at 05:46
1

@venkatakrishnan I am sorry for rushing to the acceptance of the answer, I'll go through each answer to see how well does all of them work. I just don't have the time right now. – Abrian Abir Nov 18 '19 at 05:59
okay. if that's the case, you can go ahead and accept any answer. just wanted to be sure of any mistakes. Happy learning. – venkata krishnan Nov 18 '19 at 06:00

score 2 · Answer 2 · answered Nov 18 '19 at 05:30

A way of doing it with numpy, You can use numpy.split for splitting them into separete arrays, based on the value in the first axis. Then you can find the min and max.

For more information on splitting, and how it works, you can look at the answer here. I am not repeating the same here.

ar = np.split(a, np.flatnonzero(a[1:,0] != a[:-1,0])+1,axis=0)

The above line splits and produces a list of arrays for each unique value in axis 0.

The above line will produce an output like,

[
array([[1, 5],
       [1, 7]]),
array([[ 2,  8],
       [ 2, 10],
       [ 2, 22]]), 
array([[ 3, 55],
       [ 3, 77]])
]

Then you can iterate them to find the nature of the list you are expecting in your output.

final_list = []
for i in ar:
  final_list.append([i[1][0],np.min(i[:,1]),np.max(i[:,1])])
print(final_list)

The above code will produce the output like

[[1, 5, 7], [2, 8, 22], [3, 55, 77]]

Divakar · Accepted Answer · 2019-11-18T16:39:31.853

Approach #1

A vectorized NumPy way would be -

def agg_minmax(a):
    sidx = np.lexsort(a[:,::-1].T)
    b = a[sidx]
    m = np.r_[True,b[:-1,0]!=b[1:,0],True]
    return np.c_[b[m[:-1],:2], b[m[1:],1]]

Sample run -

# Generic case with input not-necessarily sorted by first col
In [35]: a
Out[35]: 
array([[ 3, 77],
       [ 2,  8],
       [ 1,  7],
       [ 2, 10],
       [ 1,  5],
       [ 3, 55],
       [ 2, 22]])

In [36]: agg_minmax(a)
Out[36]: 
array([[ 1,  5,  7],
       [ 2,  8, 22],
       [ 3, 55, 77]])

Approach #2

We can improve on memory to sort only the first row by sidx, like so -

def agg_minmax_v2(a):
    sidx = np.lexsort(a[:,::-1].T)
    b = a[sidx,0]
    m = np.r_[True,b[:-1]!=b[1:],True]
    return np.c_[a[sidx[m[:-1]]],a[sidx[m[1:]],1]]

This could be better with a lot of entries per group.

Alternative #1 : Get sidx using linear-index-mapping

For positive int numbers, we can assume them to be on 2D grid and hence get linear index equivalents for each row. Thus, we will skip lexsort and get sidx like so -

sidx = (a[:,0]*(a[:,1].max()+1) + a[:,1]).argsort()

Rest of the code after getting sidx stays the same in both of the earlier posted approaches.

Alternative #2 : Get sidx using views

We could use views to get sidx and hence again skip lexsort, like so -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

A = view1D(a)
sidx = A.argsort()

Thank you for your response, I like the compact and contained within numpy nature of the code. I'll test it tomorrow. — Abrian Abir, Nov 18 '19 at 06:04
Thank you. Only the the alternative method for sidx using `(a[:, 0]*(a[:, 1].max()+1) + a[:, 1]).argsort()` worked for me. The other methods fails, I am not sure exactly why. — Abrian Abir, Nov 18 '19 at 16:20
It fails on the test array `np.array([[ 6, 26], [ 6 ,27], [ 6 ,27], [ 6 ,28], [ 5 ,29], [ 4 ,29], [ 3 ,29], [ 2 ,29], [ 1 ,29], [ 0 ,29], [ 0 ,30], [ 0 ,35], [ 0 ,36], [ 0 ,37], [ 5 ,37], [ 6 ,38], [ 6 ,39], [ 6 ,40], [ 5 ,41], [ 4 ,41], [ 3 ,41], [ 2 ,41], [ 1 ,42], [ 0 ,44], [ 3 ,44], [ 4 ,43], [ 4 ,42], [ 9 ,42], [10 ,54], [10 ,53], [10 ,52], [10 ,51], [11 ,50], [12 ,50], [13 ,50], [14 ,50], [14 ,45], [14 ,44], [14 ,43], [13 ,42], [14 ,39], [14 ,38], [15 ,32], [15 ,31], [14 ,30], [13 ,29], [12 ,29], [11 ,29], [ 8 ,26], [ 7 ,26]])` — Abrian Abir, Nov 18 '19 at 16:23
@MohammadIslam Ah yeah, there was a bug indeed. Needed to flip cols. Please check out the edited code. — Divakar, Nov 18 '19 at 16:39

score 0 · Answer 4 · answered Nov 18 '19 at 05:33

One way of doing this (not so nicely) is by using normal lists.

# convert to list and sort if not already sorted
alist = a.tolist()
alist.sort()

# initial values for looping
currval = alist[0][0]
min     = alist[0][1]
max     = alist[0][1]

# new list to store results in
result = []

# loop through all rows of alist
for row in alist:
    if currval == row[0]: # still same index
        max = row[1]   # update max
    else:
        result.append([currval, min, max]) # save row
        currval = row[0] # update to next index
        min     = row[1]
        max     = row[1]

# save last row
result.append([currval, min, max]) # save row
currval = row[0] # update to next index
min     = row[1]
max     = row[1]

# convert output to nparray
b = np.array(result)

It makes use of Python's sort behaviour on lists which orders the lists nicely by grouping those with the same index and having the values in increasing order.

Thank You for your response, I had something similar to your code but I am working with OpenCv contours which can have over 1000 cords for each frame. The time really adds up. — Abrian Abir, Nov 18 '19 at 05:38
That's fair. My code really isn't a good solution at all, just thought I'd throw it out in case you needed something to work while exploring other solutions. Hope you find a good one! :) — CH., Nov 18 '19 at 05:39

Aggegate min max values in one column based on another column of array - Python

4 Answers4

Linked