1

I have two arrays (data and final) and I would like to compare both arrays and return (out) the element in data which are not in final

data:

x        y      z
10.2    15.2    25.2
15.2    17.2    40.2
12.2    13.2    5.2
14.2    14.2    34.2
12.2    12.2    56.2
13.2    17.2    32.2
11.2    13.2    21.2

final:

x        y      z
15.2    17.2    40.2
14.2    14.2    34.2
12.2    12.2    56.2

out:

x        y      z
10.2    15.2    25.2
12.2    13.2    5.2
13.2    17.2    32.2
11.2    13.2    21.2

This is what I have done,

out = [np.column_stack(data[k]) for k in range(len(data)) if data[k] not in final]
out = np.vstack(out)

Problem

The problem I have is, I have to perform this action of getting my out about 10000 times (the example is just one out of 10000) and as such speed is my major concern.

Is there an efficient way to perform this?

user2554925
  • 487
  • 2
  • 8

1 Answers1

1

Here's one approach -

def remrows(a, b): # remove rows from a based on b
    ab = np.row_stack((a,b))
    sidx = np.lexsort(ab.T)
    ab_sorted = ab[sidx]
    idx = np.flatnonzero((ab_sorted[1:] == ab_sorted[:-1]).all(1))
    return np.delete(a, sidx[idx], axis=0)

If you want to account for some tolerance when comparing those floating-pt values, you might want to use np.isclose() instead of == at the idx step.

Sample run -

In [222]: a = np.random.randint(111,999,(10,3)).astype(float)/10.0

In [223]: a
Out[223]: 
array([[ 51.3,  66.3,  58.8],
       [ 24.3,  40.6,  37.8],
       [ 94.7,  28.8,  69.3],
       [ 21.8,  48.3,  57.5],
       [ 87.1,  81.9,  27.9],
       [ 14.2,  36.4,  22.2],
       [ 56.7,  58.7,  16.2],
       [ 66.2,  99.1,  62.5],
       [ 75.1,  27.8,  34.4],
       [ 59.7,  73.8,  22.3]])

In [224]: b = a[[1,3,5]]

In [225]: remrows(a, b)
Out[225]: 
array([[ 51.3,  66.3,  58.8],
       [ 94.7,  28.8,  69.3],
       [ 87.1,  81.9,  27.9],
       [ 56.7,  58.7,  16.2],
       [ 66.2,  99.1,  62.5],
       [ 75.1,  27.8,  34.4],
       [ 59.7,  73.8,  22.3]])
Divakar
  • 218,885
  • 19
  • 262
  • 358