Efficient data sifting for unique values (Python)

Question

I have a 2D Numpy array that consists of (X,Y,Z,A) values, where (X,Y,Z) are Cartesian coordinates in 3D space, and A is some value at that location. As an example..

__X__|__Y__|__Z__|__A_
  13 |  7  |  21 | 1.5
  9  |  2  |  7  | 0.5
  15 |  3  |  9  | 1.1
  13 |  7  |  21 | 0.9
  13 |  7  |  21 | 1.7
  15 |  3  |  9  | 1.1

Is there an efficient way to find all the unique combinations of (X,Y), and add their values? For example, the total for (13,7) would be (1.5+0.9+1.7), or 4.1.

hpaulj · Answer 1 · 2017-05-04T19:37:34.920

scipy.sparse matrix takes this kind of information, but for just 2d

sparse.coo_matrix((data, (row, col)))

where row and col are indices like your X,Y and Z. It sums duplicates.

The first step to doing that is a lexical sort of the indices. That puts points with matching coordinates next to each other.

The actually grouping and summing is done, I believe, in compiled code. Part of the difficulty in doing that fast in numpy terms is that there will be a variable number of elements in each group. Some will be unique, others might have 3 or more.

Python itertools has a groupby tool. Pandas also has grouping functions. I can also imagine using a default_dict to group and sum values.

The ufunc reduceat might also work, though it's easier to use in 1d than 2 or 3.

If you are ignoring the Z, the sparse coo_matrix approach may be easiest.

In [2]: X=np.array([13,9,15,13,13,15])
In [3]: Y=np.array([7,2,3,7,7,3])
In [4]: A=np.array([1.5,0.5,1.1,0.9,1.7,1.1])
In [5]: M=sparse.coo_matrix((A,(X,Y)))
In [15]: M.sum_duplicates()
In [16]: M.data
Out[16]: array([ 0.5,  2.2,  4.1])
In [17]: M.row
Out[17]: array([ 9, 15, 13])
In [18]: M.col
Out[18]: array([2, 3, 7])
In [19]: M
Out[19]: 
<16x8 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in COOrdinate format>

Here's what I had in mind with lexsort

In [32]: Z=np.array([21,7,9,21,21,9])
In [33]: xyz=np.stack([X,Y,Z],1)
In [34]: idx=np.lexsort([X,Y,Z])
In [35]: idx
Out[35]: array([1, 2, 5, 0, 3, 4], dtype=int32)
In [36]: xyz[idx,:]
Out[36]: 
array([[ 9,  2,  7],
       [15,  3,  9],
       [15,  3,  9],
       [13,  7, 21],
       [13,  7, 21],
       [13,  7, 21]])
In [37]: A[idx]
Out[37]: array([ 0.5,  1.1,  1.1,  1.5,  0.9,  1.7])

When sorted like this it becomes more evident that the Z coordinate is 'redundant', at least for this purpose.

Using reduceat to sum groups:

In [40]: np.add.reduceat(A[idx],[0,1,3])  
Out[40]: array([ 0.5,  2.2,  4.1])

(for now I just eyeballed the [0,1,3] list)

These are great, thanks! I think `lexsort` was what I was planning to do the "hard way" -- I wasn't aware there was a function for it already, hence my question. I like the way the data is retained in `lexsort`, so I'll go with that. — AaronJPung, May 04 '17 at 19:37

score 2 · Accepted Answer · answered May 04 '17 at 19:32

Approach #1

Get each row as a view, thus converting each into a scalar each and then use np.unique to tag each row as a minimum scalar starting from (0......n), withnas no. of unique scalars based on the uniqueness among others and finally usenp.bincount` to perform the summing of the last column based on the unique scalars obtained earlier.

Here's the implementation -

def get_row_view(a):
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
    a = np.ascontiguousarray(a)
    return a.reshape(a.shape[0], -1).view(void_dt).ravel()

def groupby_cols_view(x):
    a = x[:,:2].astype(int)   
    a1D = get_row_view(a)     
    _, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
    return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]

Approach #2

Same as approach #1, but instead of working with the view, we will generate equivalent linear index equivalent for each row and thus reducing each row to a scalar. Rest of the workflow is same as with the first approach.

The implementation -

def groupby_cols_linearindex(x):
    a = x[:,:2].astype(int)   
    a1D = a[:,0] + a[:,1]*(a[:,0].max() - a[:,1].min() + 1)    
    _, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
    return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]

Sample runs

In [80]: data
Out[80]: 
array([[ 2.        ,  5.        ,  1.        ,  0.40756048],
       [ 3.        ,  4.        ,  6.        ,  0.78945661],
       [ 1.        ,  3.        ,  0.        ,  0.03943097],
       [ 2.        ,  5.        ,  7.        ,  0.43663582],
       [ 4.        ,  5.        ,  0.        ,  0.14919507],
       [ 1.        ,  3.        ,  3.        ,  0.03680583],
       [ 1.        ,  4.        ,  8.        ,  0.36504428],
       [ 3.        ,  4.        ,  2.        ,  0.8598825 ]])

In [81]: groupby_cols_view(data)
Out[81]: 
array([[ 1.        ,  3.        ,  0.0762368 ],
       [ 1.        ,  4.        ,  0.36504428],
       [ 2.        ,  5.        ,  0.8441963 ],
       [ 3.        ,  4.        ,  1.64933911],
       [ 4.        ,  5.        ,  0.14919507]])

In [82]: groupby_cols_linearindex(data)
Out[82]: 
array([[ 1.        ,  3.        ,  0.0762368 ],
       [ 1.        ,  4.        ,  0.36504428],
       [ 3.        ,  4.        ,  1.64933911],
       [ 2.        ,  5.        ,  0.8441963 ],
       [ 4.        ,  5.        ,  0.14919507]])

I actually wound up using your answer -- the addition of the "Sample runs" is really helpful, to double check the function. — AaronJPung, May 04 '17 at 21:59

Efficient data sifting for unique values (Python)

2 Answers2

Linked