4

I have a numpy array of integer tuples of length 3 (x) and a numpy array of integer tuples of length 2 (y).

x = numpy.array([[3, 4, 5], [5, 12, 13], [6, 8, 10], [7, 24, 25]]) #first 4 elem
y = numpy.array([[3, 4], [4, 5], [3, 5], [5, 12]]) # first 4 elem

I am trying to compare elements within array y: [a, b] and [b, c] and [a, c] that are subsets of a single element [a, b, c] within array x. I call this function union. My loop heavy method to find the union is shown below. This is not too good for my arrays that contain 200K minimum elements.

def union(x, y):
for intx in range (len(x)):
    cond1 = cond2 = cond3 = 0
    for inty in range (len(y)):
        if (y[inty][0] == x[intx][0] and y[inty][1] == x[intx][1]): #[a, b] & [a, b, c]
            print ("condition 1 passed")
            cond1 = 1
        if (y[inty][0] == x[intx][1] and y[inty][1] == x[intx][2]): #[b, c] & [a, b, c]
            print ("condition 2 passed")
            cond2 = 1
        if (y[inty][0] == x[intx][0] and y[inty][1] == x[intx][2]): #[a, c] & [a, b, c]
            print ("condition 3 passed")
            cond3 = 1
        if (cond1 & cond2 & cond3):
            print("union found with ", x[intx])
            cond1 = cond2 = cond3 = 0
return

>>> union(x,y)
condition 1 passed
condition 2 passed
condition 3 passed
union found with  [3 4 5]
condition 1 passed

UPDATE #1: Example 1: This set of x and y have no union:

x = numpy.array([[21, 220, 221]])
y = numpy.array([[21, 220], [20, 21], [220,3021], [1220,3621], [60,221]])

UPDATE #2: Example 2: This set of x and y have no union:

x = numpy.array([[43, 924, 925]])
y = numpy.array([[43, 924], [924, 1643], [924,4307], [72, 925]])

Example 3: Here is a set of x and y that have a union of [4, 8, 16].

x = numpy.array([[4, 8, 16], [8, 4, 16]])
y = numpy.array([[4, 8], [8, 16], [4, 16]])

Example 4: Here is a set of x and y that have a union of [12, 14, 15].

x = numpy.array([[12, 13, 15], [12, 14, 15]])
y = numpy.array([[12, 14], [12, 13], [12, 15], [14, 15]])

Summary: In general terms, array x and y will have a union of [a, b, c] if

x = numpy.array([[a, b, c], ...])
y = numpy.array([[a, b], [b, c], [a, c],...])

or random ordering in y

y = numpy.array([[...[b, c], [a, c], ... [a, b]])

So my question: Is there a numpy way to do an array-wise operation? For example, numpy.logical_and suggests that x1 and x2 must be the same shape. It's not straightforward to me to replace my if statements with isdisjoint, which is a faster method. https://stackoverflow.com/a/24478521/8275288

Stoddard
  • 55
  • 1
  • 7
  • You could include a `break` after the `cond1 = cond2 = cond3 = 0` line. But I'm not sure I really grasp what you're looking for. Just a question for my understanding: you need three pairs of `y` to find a union (if all elements in `x` are not equal)? Because one must match the first and second, one must match the second and third and one must match the first and third? – MSeifert Jul 09 '17 at 21:44
  • yes, that's correct. – Stoddard Jul 09 '17 at 21:49
  • @Stoddard can elements in y serve as union pieces for more than one element in x ? – Rayhane Mama Jul 09 '17 at 21:52
  • @Rayhane yes, in theory. But the solution is where a single element in x has three simultaneous intersections in y fragments. – Stoddard Jul 09 '17 at 21:59
  • `x` is not 3d. It may represent coordinates of points in 3d space. – hpaulj Jul 09 '17 at 22:10
  • 1
    why call `union(y, x)` with `def union(x, y)` ??? and please show at least 2 matches and a fail in your example data – f5r5e5d Jul 10 '17 at 04:20
  • So are you merely interested in a binary test? Or do you want to find the indices of the matching 2 and 3 tuples? – Eelco Hoogendoorn Jul 10 '17 at 12:44
  • @Eelco - >99.999% of cases won't match. If one does, I would like the indices. – Stoddard Jul 10 '17 at 19:42
  • Would you like the 3 matching indices of the 2-tuples in y as well? – Eelco Hoogendoorn Jul 10 '17 at 20:15
  • @Eelco - I misunderstood your question. I am only interested in the values of [a,b,c] and not the indices or its position in either array. – Stoddard Jul 10 '17 at 23:46

3 Answers3

3

If you're just interested in the x "rows" that match your conditions you could use:

import numpy as np

def union(x, y):
    # Create a boolean mask for the columns of "x"
    res = np.ones(x.shape[0], dtype=bool)
    # Mask containing the "x" rows that have one "partial match"
    res_tmp = np.zeros(x.shape[0], dtype=bool)
    # Walk through the axis-combinations
    # you could also use Divakars "(x[:,:2], x[:,::2], x[:,1:])" here.
    for cols in (x[:, [0, 1]], x[:, [1, 2]], x[:, [0, 2]]):
        # Check each row of y if it has a partial match
        for y_row in y:
            res_tmp |= (y_row == cols).all(axis=1)
        # Update the overall mask and then reset the partial match mask
        res &= res_tmp
        res_tmp[:] = 0
    return res

x = np.array([[3, 4, 5], [5, 12, 13], [6, 8, 10], [7, 24, 25]])
y = np.array([[3, 4], [4, 5], [3, 5], [5, 12]])
mask = union(x, y)
print(mask)     # array([ True, False, False, False], dtype=bool)
print(x[mask])  # array([[3, 4, 5]])

Or for a different y:

y = np.array([[3, 4], [4, 5], [3, 5], [5, 12], [12, 13], [5, 13]])
mask = union(x, y)
print(x[mask])
# array([[ 3,  4,  5],
#        [ 5, 12, 13]])

It still has to loop twice but the inner operation y_row == x[:, ax] is vectorized. That should bring at least some (probably huge) speed improvement.

One could also vectorize the for y_row in y loop (using broadcasting), but if your x array and y are really big this wouldn't improve the performance much but it would use len(x) * len(y) memory (in some cases this could require more memory than you actually have - leading to an Exception or really poor performance because you fallback to swap memory).

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • The `res` and `x[res]` only produce an output when you're in an interactive prompt. In modules you have to use `print(res)` or `print(x[res])`. You could also wrap it as a function and return `res` or `x[res]` :) – MSeifert Jul 09 '17 at 22:19
  • That's the thing here, it would force copy there with `x[:, ax]`. Still a good thought! – Divakar Jul 09 '17 at 23:52
  • @Stoddard Could you clarify which example gives an incorrect union? I tried all 4 and they produce the correct unions. – MSeifert Jul 10 '17 at 09:03
  • res_tmp[:] = 0 did the trick to reset the logic. x of 4950 and y of 2000 takes 0.11 sec. – Stoddard Jul 10 '17 at 09:26
1

The numpy_indexed package (disclaimer: I am its author) can be used to create a fairly straightforward vectorized version of your original code, which should be much more efficient:

from functools import reduce
import numpy_indexed as npi

def contains_union(x, y):
    """Returns an ndarray with a bool for each element in x, 
    indicating if it can be constructed as a union of elements in y"""
    idx = [[0, 1], [1, 2], [0, 2]]
    y = npi.as_index(y)   # not required, but a performance optimization
    return reduce(np.logical_and, (npi.in_(x[:, i], y) for i in idx))
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • should idx = [[0 1], [1, 2], [0, 2]]? Since I have all x and y sorted I can also optimize by stopping the search once y[0] > x[0]. – Stoddard Jul 10 '17 at 19:54
  • Instead of `functools.reduce` you could simply use `np.logical_and.reduce(...)` or am I wrong? – MSeifert Jul 10 '17 at 20:00
  • Didnt think of that, but yeah that should give the same result; personally i slightly prefer the functools solution though, since it avoids unncessary intermediates. – Eelco Hoogendoorn Jul 10 '17 at 20:02
  • 1
    I just tested it but it gives me `array([False, False, False, False], dtype=bool)` for `x = np.array([[3, 4, 5], [5, 12, 13], [6, 8, 10], [7, 24, 25]]); y = np.array([[3, 4], [4, 5], [3, 5], [5, 12]])` which seems wrong. – MSeifert Jul 10 '17 at 20:06
  • 1
    Good catch; npi.contains should have been npi.in_ – Eelco Hoogendoorn Jul 10 '17 at 20:09
  • x (96373 elements) and y (6754855 elements) returned results in 5.001435556994693 seconds wall-clock time (MacBook Pro)! – Stoddard Jul 11 '17 at 00:13
  • Glad to help; Out of curiousity, how does that timing compare to the non-vectorized version? – Eelco Hoogendoorn Jul 11 '17 at 11:54
  • Is there a way to make this work on x and y arrays of dtype=object, one where int values are > int64, eg 10000000000000000000012345 (25 digits)? – Stoddard Jul 15 '17 at 01:44
  • 1
    If such arrays of objects can be np.argsort-ed (not sure, give it a try), then first casting them to npi.LexIndex before feeding them to any other npi functions should do the trick, yes. – Eelco Hoogendoorn Jul 15 '17 at 07:02
0

if your x values have a max less than sqrt of the max integer representation (use int 64?), then a numerical trick may work

I used int(1e6) just as a readable example

import numpy

#rolled up all of the examples
x = numpy.array([[3, 4, 5], [5, 12, 13], [6, 8, 10], [7, 24, 25],
                [21, 220, 221],
                [43, 924, 925],
                [4, 8, 16], [8, 4, 16],
                [12, 13, 15], [12, 14, 15]]) #all examples

#and a numpy array of integer tuples of length 2:

y = numpy.array([[3, 4], [4, 5], [3, 5], [5, 12],
                [21, 220], [20, 21], [220,3021], [1220,3621], [60,221],
                [43, 924], [924, 1643], [924,4307], [72, 925],
                [4, 8], [8, 16], [4, 16],
                [12, 14], [12, 13], [12, 15], [14, 15]]) #all examples

#then make a couple of transform arrays

zx=numpy.array([[int(1e6), 1, 0],
                [0, int(1e6), 1],
                [int(1e6), 0, 1],
                ])
zy = numpy.array([[int(1e6)], [1]])


# and the magic is: np.intersect1d(zx @ x[ix], y @ zy)

# just to see part of what is being fed to intersect1d
print(zx @ x[0])
 [3000004 4000005 3000005]  

print(y[:4] @ zy)
[[3000004]
 [4000005]
 [3000005]
 [5000012]]


# if len of the intersection == 3 then you have your match

y_zy = y @ zy # only calc once
for ix in range(len(x)):
    matches = len(np.intersect1d(zx @ x[ix], y_zy))
    print(ix, matches, x[ix] if matches == 3 else '')
0 3 [3 4 5]
1 2 
2 0 
3 0 
4 1 
5 1 
6 3 [ 4  8 16]
7 2 
8 2 
9 3 [12 14 15]

I don't know intersect1d speed, but according to the doc it could be improved if the unique=True flag could be set, depends on your data

f5r5e5d
  • 3,656
  • 3
  • 14
  • 18
  • Nice approach, but why the grandiose 1e6? when a mask with zx = numpy.array([[1, 0, 0], [0, 1, 0], [1, 0, 0]]) and zy = numpy.array([[1], [0]]) gives the same magic numbers for (zx @ x[0]) and (y[:4] @ zy – Stoddard Jul 10 '17 at 19:28
  • I am encoding the each of the [a,b], [b,c], [a,c] pairs as single ints, perhaps array mult is slower than masking but I expect comparison of 2 ints is faster than parsing 2 2 element arrays for equality – f5r5e5d Jul 10 '17 at 20:45