How to find missing combinations/sequences in a 2D array with finite element values

Question

In the case of the set np.array([1, 2, 3]), there are only 9 possible combinations/sequences of its constituent elements: [1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3].

If we have the following array:

np.array([1, 1],
         [1, 2],
         [1, 3],
         [2, 2],
         [2, 3],
         [3, 1],
         [3, 2])

What is the best way, with NumPy/SciPy, to determine that [2, 1] and [3, 3] are missing? Put another way, how do we find the inverse list of sequences (when we know all of the possible element values)? Manually doing this with a couple of for loops is easy to figure out, but that would negate whatever speed gains we get from using NumPy over native Python (especially with larger datasets).

I assume, the word set indicates that all elements are unique. Is this true? Is the numpy array sorted or is this just an unrepresentative example? — Mr. T, Feb 18 '18 at 15:30
@MrT, you're correct. By "set", I mean a list of unique elements (where the order doesn't matter). As for the target array, no, one can't assume it comes sorted. Sorry, I forgot to mention this. — CircleSquared, Feb 18 '18 at 15:36
Broadcast the *unknown* array across the *known* array in a third dimension in a comparison statement and use numpy.all() and numpy.any() to see if all the combinations exist in the *unknown* array. You could also use the numpy.any() portion as a mask to the *known* array to see which ones are present/missing. — wwii, Feb 18 '18 at 15:43
@CircleSquared, did one of the below solutions help? If so, feel free to accept an answer (tick on left). — jpp, Feb 20 '18 at 00:13

score 2 · Answer 1 · answered Feb 18 '18 at 16:40

Your can generate a list of all possible pairs using itertools.product and collect all of them which are not in your array:

from itertools import product

pairs = [ [1, 1], [1, 2], [1, 3], [2, 2], [2, 3], [3, 1], [3, 2] ]
allPairs = list(map(list, product([1, 2, 3], repeat=2)))
missingPairs = [ pair for pair in allPairs if pair not in pairs ]
print(missingPairs)

Result:

[[2, 1], [3, 3]]

Note that map(list, ...) is needed to convert your list of list to a list of tuples that can be compared to the list of tuples returned by product. This can be simplified if your input array already was a list of tuples.

jpp · Answer 2 · 2018-02-18T17:30:07.763

2

This is one way using itertools.product and set.

The trick here is to note that sets may only contain immutable types such as tuples.

import numpy as np
from itertools import product

x = np.array([1, 2, 3])

y = np.array([[1, 1], [1, 2], [1, 3], [2, 2],
              [2, 3], [3, 1], [3, 2]])

set(product(x, repeat=2)) - set(map(tuple, y))

{(2, 1), (3, 3)}

edited Feb 18 '18 at 17:30

answered Feb 18 '18 at 16:49

jpp

159,742
34
281
339

1

I like the idea to use `set`. You can greatly simplify the line of code writing `set(product(x, repeat=2)) - set(map(tuple, y))` – Falko Feb 18 '18 at 17:23

score 1 · Answer 3 · answered Feb 18 '18 at 19:11

a = np.array([1, 2, 3])
b = np.array([[1, 1],
             [1, 2],
             [1, 3],
             [2, 2],
             [2, 3],
             [3, 1],
             [3, 2]])

c = np.array(list(itertools.product(a, repeat=2)))

If you want to use numpy methods, try this...

Compare the array being tested against the product using broadcasting

d = b == c[:,None,:]
#d.shape is (9,7,2)

Check if both elements of a pair matched

e = np.all(d, -1)
#e.shape is (9,7)

Check if any of the test items match an item of the product.

f = np.any(e, 1)
#f.shape is (9,)

Use f as a boolean index into the product to see what is missing.

>>> print(c[np.logical_not(f)])
[[2 1]
 [3 3]]
>>>

score 1 · Accepted Answer · answered Feb 19 '18 at 08:37

1

If you want to stay in numpy instead of going back to raw python sets, you can do it using void views (based on @Jaime's answer here) and numpy's built in set methods like in1d

def vview(a):
    return np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))

x = np.array([1, 2, 3])

y = np.array([[1, 1], [1, 2], [1, 3], [2, 2],
              [2, 3], [3, 1], [3, 2]])


xx = np.array([i.ravel() for i in np.meshgrid(x, x)]).T

xx[~np.in1d(vview(xx), vview(y))]

array([[2, 1],
       [3, 3]])

answered Feb 19 '18 at 08:37

Daniel F

13,620
2
29
55

Thank you very much for taking the time. Not only do both parts of your answer (generating xx and comparing it against y) stay inside NumPy, timeit() shows they're *significantly* faster than the other options here. In fact, both aspects of your solution appeared to be over an order of magnitude faster than many of the other possibilities when I scaled my y up to hundreds of thousands of sequences and x to over 1 million possibilities. Well written as the other options here may be, Python has its bottlenecks. – CircleSquared Mar 05 '18 at 03:03
Glad you had enough memory for that! Although I guess since the rest were turning `itertools` products into lists (instead of iterating them) they'd have the same memory problems – Daniel F Mar 05 '18 at 05:37
Actually, I wasn't using much RAM. (4 elements)^(10 places) = 1048576 possibilities. That times 10 (for each digit) times 8 bytes (for 64 bit integers) gives a little under 84 MiB. Similarly, 10 digits sequences * 100 thousand * 8 bytes = 8 MiB. NumPy arrays have an overhead of a few tens of bytes, but that's insignificant on these scales. What's killer is execution time. – CircleSquared Mar 05 '18 at 12:00
By the way, for sets with a small enough number of possible elements, it looks like mapping the digits to characters can cut RAM usage by half. Of course, there's no improvement in execution time, as the number of possibilities stay the same. – CircleSquared Mar 05 '18 at 12:59

score -1 · Answer 5 · answered Feb 19 '18 at 09:03

Every combination corresponds to the number in range 0..L^2-1 where L=len(array). For example, [2, 2]=>3*(2-1)+(2-1)=4. Off by -1 arises because elements start from 1, not from zero. Such mapping might be considered as natural perfect hashing for this data type.

If operations on integer sets in NumPy are faster than operations on pairs - for example, integer set of known size might be represented by bit sequence (integer sequence) - then it is worth to traverse pair list, mark corresponding bits in integer set, then look for unset ones and retrieve corresponding pairs.

Downvoter - some arguments could help me to realize the fallacies and comprehend a bit of wisdom... — MBo, Feb 25 '18 at 03:28

How to find missing combinations/sequences in a 2D array with finite element values

5 Answers5