0

I have a 2d numpy array, call it C:

A = np.array([1,10,2])
B = np.array([4,-2,5])
C = np.vstack([A,B])

and another 2d numpy array, call it G:

E = np.array([4,2,6])
F = np.array([0,5,30])
G = np.vstack([E,F])

I would like to return the 1d boolean that is true if a column in G matches a column in C, so in this case

output = [False,True,False]

The second element here is true because (2,5) is the second element in G and also matches the third element in C.

In reality, C and G are arrays with ~3million elements, but figuring this out should be good enough!

5 Answers5

0

I believe this fits your needs for the given example. I'm not good enough with numpy to know if will scale well to millions of records though.

import numpy as np

A = np.array([1,10,2])
B = np.array([4,-2,5])
C = np.vstack([A,B]).T

E = np.array([4,2,6])
F = np.array([0,5,30])
G = np.vstack([E,F]).T

matches = [(C == g).any() for g in [g for g in G]]
print(matches)
SteveJ
  • 3,034
  • 2
  • 27
  • 47
0

You may define a contiguous view and use np.in1d

make_view = lambda a : np.ascontiguousarray(a.T).view([('', a.dtype)] * a.shape[0]).T.ravel()
Cv, Gv = make_view(C), make_view(G)

>>> np.in1d(Gv, Cv)
array([False,  True, False])
rafaelc
  • 57,686
  • 15
  • 58
  • 82
0

You didn't mention the number of columns you have so I assumed its small.

C_r = np.repeat(C[:,:,np.newaxis],C.shape[1],axis=2)
G_r = np.repeat(G[:,:,np.newaxis],G.shape[1],axis=2)
G_r = np.transpose(G_r,(0,2,1))

a = ~np.sum(G_r-C_r,axis=0).astype(bool)
np.any(a,axis=0)
Out[95]: array([False,  True, False])
HadarM
  • 113
  • 1
  • 9
0
>>> g=G.transpose()
>>> c=set(tuple(map(tuple, C.transpose())))
>>> np.array([tuple(item) in c for item in g])

    array([False,  True, False])
Ravi Sharma
  • 162
  • 1
  • 3
0

Just to throw my pandas idea in here, too:

import  pandas as pd

dfc = pd.DataFrame(C).apply(tuple)
dfg = pd.DataFrame(G).apply(tuple)

print(dfg.isin(dfc))

# 0    False
# 1     True                                                
# 2    False                                                  
# dtype: bool                                      

However, tupelizing millions of elements might be no fun though... :)

SpghttCd
  • 10,510
  • 2
  • 20
  • 25
  • 1
    Thanks for changing to my answer as accepted one. However, I' d be interested in the reason. I did not measure perfomance yet, but I hardly can imagine to have the most efficient answer when @rafaelc has also made an approach, and meaningfully using numpy often beats other algorithms. – SpghttCd Sep 16 '19 at 18:59