2

I have an array with coordinates of N points. Another array contains the masses of these N points.

 >>> import numpy as np
 >>> N=10
 >>> xyz=np.random.randint(0,2,(N,3))
 >>> mass=np.random.rand(len(xyz))
 >>> xyz
 array([[1, 0, 1],
   [1, 1, 0],
   [0, 1, 1],
   [0, 0, 0],
   [0, 1, 0],
   [1, 1, 0],
   [1, 0, 1],
   [0, 0, 1],
   [1, 0, 1],
   [0, 0, 1]])
 >>> mass
 array([ 0.38668401,  0.44385111,  0.47756182,  0.74896529,  0.20424403,
    0.21828435,  0.98937523,  0.08736635,  0.24790248,  0.67759276])

Now I want to obtain an array with unique values of xyz and a corresponding array of summed up masses. That means the following arrays:

 >>> xyz_unique
 array([[0, 1, 1],
   [1, 1, 0],
   [0, 0, 1],
   [1, 0, 1],
   [0, 0, 0],
   [0, 1, 0]])
 >>> mass_unique
 array([ 0.47756182,  0.66213546,  0.76495911,  1.62396172,  0.74896529,
    0.20424403])

My attempt was the following code with a double for-loop:

 >>> xyz_unique=np.array(list(set(tuple(p) for p in xyz)))
 >>> mass_unique=np.zeros(len(xyz_unique))
 >>> for j in np.arange(len(xyz_unique)):
 ...     indices=np.array([],dtype=np.int64)
 ...     for i in np.arange(len(xyz)):
 ...         if np.all(xyz[i]==xyz_unique[j]):
 ...             indices=np.append(indices,i)
 ...     mass_unique[j]=np.sum(mass[indices])

The problem is that this takes too long, I actually have N=100000. Is there a faster way or how could I improve my code?

EDIT My coordinates are actually float numbers. To keep things simple, I made random integers to have duplicates at low N.

Andy
  • 1,072
  • 2
  • 19
  • 33
  • 1
    Related: [Find unique rows in numpy.array](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array) – Ashwini Chaudhary Apr 10 '15 at 17:10
  • this looks interesting.. the accepted answer creates the indices of unique values very fast, but I lose the information about the duplicate values. So it gets hard to reconstruct which mass values to sum up afterwards. Or which answer should I consider carefully? – Andy Apr 10 '15 at 17:30

1 Answers1

2

Case 1: Binary numbers in xyz

If the elements in the input array xyz were 0's and 1's, you can convert each row into a decimal number, then label each row based on their uniqueness with other decimal numbers. Then, based on those labels, you can use np.bincount to accumulate the summations, just like in MATLAB one could use accumarray. Here's the implementation to achieve all that -

import numpy as np

# Input arrays xyz and mass
xyz = np.array([
   [1, 0, 1],
   [1, 1, 0],
   [0, 1, 1],
   [0, 0, 0],
   [0, 1, 0],
   [1, 1, 0],
   [1, 0, 1],
   [0, 0, 1],
   [1, 0, 1],
   [0, 0, 1]])

mass = np.array([ 0.38668401,  0.44385111,  0.47756182,  0.74896529,  0.20424403,
    0.21828435,  0.98937523,  0.08736635,  0.24790248,  0.67759276])

# Convert each row entry in xyz into equivalent decimal numbers
dec_num = np.dot(xyz,2**np.arange(xyz.shape[1])[:,None])

# Get indices of the first occurrences of the unique values and also label each row
_, unq_idx,row_labels = np.unique(dec_num, return_index=True, return_inverse=True)

# Find unique rows from xyz array
xyz_unique = xyz[unq_idx,:]

# Accumulate the summations from mass based on the row labels
mass_unique = np.bincount(row_labels, weights=mass)

Output -

In [148]: xyz_unique
Out[148]: 
array([[0, 0, 0],
       [0, 1, 0],
       [1, 1, 0],
       [0, 0, 1],
       [1, 0, 1],
       [0, 1, 1]])

In [149]: mass_unique
Out[149]: 
array([ 0.74896529,  0.20424403,  0.66213546,  0.76495911,  1.62396172,
        0.47756182])

Case 2: Generic

For a general case, you can use this -

import numpy as np

# Perform lex sort and get the sorted indices
sorted_idx = np.lexsort(xyz.T)
sorted_xyz =  xyz[sorted_idx,:]

# Differentiation along rows for sorted array
df1 = np.diff(sorted_xyz,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)

# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1

# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels

# Get unique indices
unq_idx  = sorted_idx[df2]

# Get unique xyz's and the mass counts using accumulation with bincount
xyz_unique = xyz[unq_idx,:]
mass_unique = np.bincount(labels, weights=mass)

Sample run -

In [238]: xyz
Out[238]: 
array([[1, 2, 1],
       [1, 2, 1],
       [0, 1, 0],
       [1, 0, 1],
       [2, 1, 2],
       [2, 1, 1],
       [0, 1, 0],
       [1, 0, 0],
       [2, 1, 0],
       [2, 0, 1]])

In [239]: mass
Out[239]: 
array([ 0.5126308 ,  0.69075674,  0.02749734,  0.384824  ,  0.65151772,
        0.77718427,  0.18839268,  0.78364902,  0.15962722,  0.09906355])

In [240]: xyz_unique
Out[240]: 
array([[1, 0, 0],
       [0, 1, 0],
       [2, 1, 0],
       [1, 0, 1],
       [2, 0, 1],
       [2, 1, 1],
       [1, 2, 1],
       [2, 1, 2]])

In [241]: mass_unique
Out[241]: 
array([ 0.78364902,  0.21589002,  0.15962722,  0.384824  ,  0.09906355,
        0.77718427,  1.20338754,  0.65151772])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • thx, this is a good trick! .. but my coordinates are actually float numbers with up to 5 decimals. – Andy Apr 10 '15 at 18:49
  • wow, incredible!! how did you come up with this? I don't understand one thing: the [True] that you append in df2 is then subtracted in sorted_labels below.. why is this necessary? – Andy Apr 10 '15 at 20:37
  • @Andy Well I sorted the array and then did `diff` to look for unique elements along the rows. Now in that sorted array, the first element would always be the unique one and that `diff` would have reduced the length by 1. So, to compensate for that minus one *diff*-ed array and the fact the first sorted element would always be the unique one, that True was apended at the start. That creates the logical array which can be used to map onto the rows of the input data. Hope that makes sense! Just curious how much of speedups are you seeing with this? – Divakar Apr 10 '15 at 20:44
  • @Andy That subtraction below was because the appending of True and then cumsum would make the array start from 1 and since we need to use this array for indexing, we need to subtract by 1. – Divakar Apr 10 '15 at 20:45
  • Right, this makes sense. Now it takes less than 1sec^^.. I was considering multiprocessing in order to parallelize my for-loop and it still took ~3h. Thanks!!! – Andy Apr 10 '15 at 20:52
  • @Andy That's crazy!! Well, awesome, thanks for the runtime updates! – Divakar Apr 10 '15 at 20:53