4

I need to sample a bunch of pairs of points from an arrary. I want that each pair consists of two DISTINCT points, but the points may be repeated amongst the various pairs.

e.g., if my array is X=np.array([1,1,2,3]), then:

>>> sample_pairs(X, n=4)
... [[1,1], [2,3], [1,2], [1,3]] # this is fine
>>> sample_pairs(X, n=4)
... [[1,1], [2,2], [3,3], [1,3]] # this is not okay

Is there a good way to accomplish this as a vectorized operation?

Gulzar
  • 23,452
  • 27
  • 113
  • 201
Him
  • 5,257
  • 3
  • 26
  • 83
  • The actual use case is that I'm trying to bootstrap the distribution of pairwise distances, without computing all of the pairwise distances, which would be `O(n^2)` – Him Jul 12 '18 at 15:56
  • Show us how you intend to use these pairs. Maybe even a sample calculation without the pruning. Often in `numpy` it's faster to do all calculations in one vectorized operation, rather than taking extra time to skip the redundant calculations. – hpaulj Jul 12 '18 at 16:05
  • I intend to sample a bunch of random pairs, compute the distances between those pairs, and then return the mean and stddev of the resulting list of distances. Note that I *could* compute the pairwise distances, but am hoping that I get a reasonable approximation of the mean and stddev of the pairwise distances in something like linear time. – Him Jul 12 '18 at 16:08
  • The goal with this question was to randomly choose pairs of points from `X` – Him Jul 12 '18 at 16:09
  • Seems simpler to compute the whole sample with replacement and then throw out same-point pairs if you don't want those. – user2357112 Jul 12 '18 at 16:27
  • @user2357112, this doesn't work, since if `X` has repeated points 1 and 1, then the pair `[1,1]` is fine. This is illustrated in the example. – Him Jul 12 '18 at 16:32
  • @user2357112, of course, this could be easily remedied by sampling the indices, instead of the actual points. – Him Jul 12 '18 at 16:34
  • In your `sample_pairs()`, you have `n`--should we assume that cannot be larger than the number of combinations? (And, does order matter?) – Brad Solomon Jul 12 '18 at 16:40
  • @BradSolomon order does not matter, and since the pairs are with replacement, then n can be arbitrarily large. However, the goal here was to do something faster than computing all of the pairwise distances. – Him Jul 12 '18 at 16:45
  • There's a huge difference between the relevance of different solutions for "n arbitrarily large" and "n = 4 (cf. the comments below)". – fuglede Jul 12 '18 at 17:30
  • Similar recent SO: [Sampling unique column indexes for each row of a numpy array](https://stackoverflow.com/questions/51279464/sampling-unique-column-indexes-for-each-row-of-a-numpy-array) – hpaulj Jul 12 '18 at 17:33
  • @fuglede, I was thrown off by the word "cannot". In my application, n will be sub O(N^2), where N == len(X). – Him Jul 12 '18 at 18:55

2 Answers2

4

To sample a pair without replacements, you can use np.random.choice:

np.random.choice(X, size=2, replace=False)

Alternatively, to sample multiple elements at a time, note that all possible pairs may be represented by the elements of range(len(X)*(len(X)-1)/2), and sample from that using np.random.randint.

combs = np.array(list(itertools.combinations(X, 2)))
sample = np.random.randint(len(combs), size=10)
combs[sample[np.newaxis]]

Following up on @user2357112's comment, given from the OP's own answer that they do not appear to care if the sample size itself is deterministic, and noting that sampling with the Mersenne Twister is slower than basic arithmetic operations, a different solution if X is so large that generating the combinations is not feasibile would be

sample = np.random.randint(len(X)**2, size=N)
i1 = sample // len(X)
i2 = sample % len(X)
X[np.vstack((i1, i2)).T[i1 != i2]]

This produces a sample whose average size is N * (1 - 1/len(X)).

fuglede
  • 17,388
  • 2
  • 54
  • 99
  • This only samples one pair. I *could* sample `n` pairs and then `.vstack` them, but that seems inefficient. I was hoping that a single vector operation would get me the whole shebang. – Him Jul 12 '18 at 16:13
  • Fair enough; added a means of doing that as well. – fuglede Jul 12 '18 at 16:31
  • I upvoted, but wonder if their is a better way of getting multiple pairs than `combinations -> list`, because that can get out of hand pretty quickly for big arrays – Brad Solomon Jul 12 '18 at 16:37
  • `list(itertools.combinations(X,2))` is `O(n^2)`, which is even less efficient than the `.vstack` option.... – Him Jul 12 '18 at 16:38
  • @Scott Computational complexity isn't everything; the constant plays a huge role in vectorized operations, so you need to provide an example closer to your actual use case if that isn't good enough. – fuglede Jul 12 '18 at 16:40
  • @BradSolomon: You can provide the inverse mapping from `range(len(X)*(len(X)-1)/2)` to `range(len(X)) \times range(len(X))` explicitly and avoid all allocation at the cost of computational time. (This also ends up being $O(n)$ which the OP likes.) – fuglede Jul 12 '18 at 16:41
  • I would, but including the entire NHTS 2017 dataset in my question was unfortunately disallowed. As such, I created a fake dataset to illustrate the general idea. :) – Him Jul 12 '18 at 16:43
  • @Scott The exact data is unlikely to matter; simply knowing the sizes at play should suffice. – fuglede Jul 12 '18 at 16:45
  • Well, if the number of requested pairs `n` is, say, 4, I should be able to perform this is O(4) operations. The `itertools` solution requries O(N^2) operations, where `N` is the size of my entire dataset. I get that the O time isn't everything, but constant time vs quadratic time is quite a leap. – Him Jul 12 '18 at 17:01
  • Well, `[np.random.choice(X, size=2, replace=False) for _ in range(N)]` is `O(N)` but likely to be substantially slower than the second solution provided above (by a factor of 620 if I try it with `N = 1000`), so there's no leap there; if the ratio of array size to sample size is inverted, it's obviously a different story, hence the importance of the sizes involved. (Also, O(4) is not a meaningful expression.) – fuglede Jul 12 '18 at 17:09
  • O(4) is absolutely a meaningful expression. It so happens that O(4) == O(1), but I included the 4 for clarity. Actually, the first operation is O(n) (where n was 4 in my example). The second operation is O(n*N^2), where N is len(X). So the first operation is, indeed, constant in N, and the second is quadratic. If N is small, then the constant comes into play. However, my objective was to not need to compute the pairwise distances (which is O(N^2)). – Him Jul 12 '18 at 18:12
  • If you worry about the computational complexity of `len(X)` (where I had assumed that `n` is what you were worrying about), you also can not ignore that of the PRNG, whose complexity depends on the size of the range in which to generate numbers. – fuglede Jul 12 '18 at 19:01
1

Here is @user2357112's solution:

def sample_indices(X, n=4):
    pair_indices = np.random.randint(X.shape[0]**2, size=n)
    pair_indices = np.hstack(((pair_indices // X.shape[0]).reshape((-1,1)), (pair_indices % X.shape[0]).reshape((-1,1))))
    good_indices = pair_indices[:,0] != pair_indices[:,1]
    return X[pair_indices[good_indices]]
Him
  • 5,257
  • 3
  • 26
  • 83
  • 1
    `np.random.randint(X.shape[0], size=(n, 2))` would likely be significantly faster as well as avoid an allocation in the dimension you appear to worry about. – fuglede Jul 12 '18 at 17:27
  • 1
    Note also that this does not actually provide a sample of size `n`, as you throw away `sqrt(len(X))` of them om average. If that's not an issue, a faster solution would be to generate a sample `s = np.random.randint(len(X)**2, size=n)` and use `s // len(X)` and `s % len(X)` to provide the indices (since these simple operations are much faster than running the Mersenne Twister for the additional rounds, the speed-up being roughly a doubling). – fuglede Jul 12 '18 at 17:55
  • I will update this solution to reflect your excellent suggestions. – Him Jul 12 '18 at 17:58
  • 1
    Where? (But yep, `sqrt(n)` should have been `1/len(X)` at least.) – fuglede Jul 12 '18 at 17:59
  • "where?"... NM. I was confused. :) – Him Jul 12 '18 at 18:01