Remove column from numpy array according to columns in another array?

Question

I have a 2d numpy array, call it C:

A = np.array([1,10,2])
B = np.array([4,-2,5])
C = np.vstack([A,B])

and another 2d numpy array, call it G:

E = np.array([4,2,6])
F = np.array([0,5,30])
G = np.vstack([E,F])

I would like to return the 1d boolean that is true if a column in G matches a column in C, so in this case

output = [False,True,False]

The second element here is true because (2,5) is the second element in G and also matches the third element in C.

In reality, C and G are arrays with ~3million elements, but figuring this out should be good enough!

score 0 · Answer 1 · answered Sep 15 '19 at 17:59

I believe this fits your needs for the given example. I'm not good enough with numpy to know if will scale well to millions of records though.

import numpy as np

A = np.array([1,10,2])
B = np.array([4,-2,5])
C = np.vstack([A,B]).T

E = np.array([4,2,6])
F = np.array([0,5,30])
G = np.vstack([E,F]).T

matches = [(C == g).any() for g in [g for g in G]]
print(matches)

score 0 · Answer 2 · answered Sep 15 '19 at 18:02

0

You may define a contiguous view and use np.in1d

make_view = lambda a : np.ascontiguousarray(a.T).view([('', a.dtype)] * a.shape[0]).T.ravel()
Cv, Gv = make_view(C), make_view(G)

>>> np.in1d(Gv, Cv)
array([False,  True, False])

answered Sep 15 '19 at 18:02

rafaelc

57,686
15
58
82

score 0 · Answer 3 · answered Sep 15 '19 at 18:06

You didn't mention the number of columns you have so I assumed its small.

C_r = np.repeat(C[:,:,np.newaxis],C.shape[1],axis=2)
G_r = np.repeat(G[:,:,np.newaxis],G.shape[1],axis=2)
G_r = np.transpose(G_r,(0,2,1))

a = ~np.sum(G_r-C_r,axis=0).astype(bool)
np.any(a,axis=0)
Out[95]: array([False,  True, False])

Ravi Sharma · Answer 4 · 2019-09-15T19:10:07.777

0

>>> g=G.transpose()
>>> c=set(tuple(map(tuple, C.transpose())))
>>> np.array([tuple(item) in c for item in g])

    array([False,  True, False])

edited Sep 15 '19 at 19:10

answered Sep 15 '19 at 18:08

Ravi Sharma

162
1
3

That will not work because it assumed elements are independent - it is not searching for the whole column. – rafaelc Sep 15 '19 at 18:12
I think it should because the question says that "return the 1d boolean that is true if a column in G matches a column in C" – Ravi Sharma Sep 15 '19 at 18:15
can you provide any G and C matrix for which it will fail? – Ravi Sharma Sep 15 '19 at 18:20
Change the `0` in `G` for `5` and you'll see – rafaelc Sep 15 '19 at 18:21
@rafaelc Thanks for pointing it out. I have updated my answer which might be less efficient compared to other solutions. – Ravi Sharma Sep 15 '19 at 19:11

score 0 · Accepted Answer · answered Sep 15 '19 at 19:16

0

Just to throw my pandas idea in here, too:

import  pandas as pd

dfc = pd.DataFrame(C).apply(tuple)
dfg = pd.DataFrame(G).apply(tuple)

print(dfg.isin(dfc))

# 0    False
# 1     True                                                
# 2    False                                                  
# dtype: bool

However, tupelizing millions of elements might be no fun though... :)

answered Sep 15 '19 at 19:16

SpghttCd

10,510
2
20
25

1

Thanks for changing to my answer as accepted one. However, I' d be interested in the reason. I did not measure perfomance yet, but I hardly can imagine to have the most efficient answer when @rafaelc has also made an approach, and meaningfully using numpy often beats other algorithms. – SpghttCd Sep 16 '19 at 18:59

Remove column from numpy array according to columns in another array?

5 Answers5