7

How to sample without replacement in TensorFlow? Like numpy.random.choice(n, size=k, replace=False) for some very large integer n (e.g. 100k-100M), and smaller k (e.g. 100-10k). Also, I want it to be efficient and on the GPU, so other solutions like this with tf.py_func are not really an option for me. Anything which would use tf.range(n) or so is also not an option because n could be very large.

Albert
  • 65,406
  • 61
  • 242
  • 386
  • I've undeleted my answer with another method. – jdehesa Mar 29 '18 at 14:55
  • So `n` is very large. How is the size of `k`, the number of elements you need to sample, compared to `n`? – mikkola Mar 29 '18 at 17:54
  • @mikkola E.g. `n` is 100k-100M or so, and `k` maybe 100-10k. – Albert Mar 29 '18 at 18:15
  • I just stumbled upon this blog post: [Tim Vieira, Algorithms for sampling without replacement, 2019](https://timvieira.github.io/blog/post/2019/09/16/algorithms-for-sampling-without-replacement/) – Albert May 01 '23 at 21:52

3 Answers3

6

This is one way:

n = ...
sample_size = ...
idx = tf.random_shuffle(tf.range(n))[:sample_size]

EDIT:

I had posted the answer below but then read the last line of your post. I don't think there is a good way to do it if you absolutely cannot produce a tensor with size O(n) (numpy.random.choice with replace=False is also implemented as a slice of a permutation). You could resort to a tf.while_loop until you have unique indices:

n = ...
sample_size = ...
idx = tf.zeros(sample_size, dtype=tf.int64)
idx = tf.while_loop(
    lambda i: tf.size(idx) == tf.size(tf.unique(idx)),
    lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))

EDIT 2:

About the average number of iterations in the previous method. If we call n the number of possible values and k the length of the desired vector (with kn), the probability that an iteration is successful is:

p = product((n - (i - 1) / n) for i in 1 .. k)

Since each iteartion can be considered a Bernoulli trial, the average number of trials unitl first success is 1 / p (proof here). Here is a function that calculates the average numbre of trials in Python for some k and n values:

def avg_iter(k, n):
    if k > n or n <= 0 or k < 0:
        raise ValueError()
    avg_it = 1.0
    for p in (float(n) / (n - i) for i in range(k)):
        avg_it *= p
    return avg_it

And here are some results:

+-------+------+----------+
|   n   |  k   | Avg iter |
+-------+------+----------+
|    10 |    5 | 3.3      |
|   100 |   10 | 1.6      |
|  1000 |   10 | 1.1      |
|  1000 |  100 | 167.8    |
| 10000 |   10 | 1.0      |
| 10000 |  100 | 1.6      |
| 10000 | 1000 | 2.9e+22  |
+-------+------+----------+

You can see it varies wildy depending on the parameters.

It is possible, though, to construct a vector in a fixed number of steps, although the only algorithm I can think of is O(k2). In pure Python it goes like this:

import random

def sample_wo_replacement(n, k):
    sample = [0] * k
    for i in range(k):
        sample[i] = random.randint(0, n - 1 - len(sample))
    for i, v in reversed(list(enumerate(sample))):
        for p in reversed(sample[:i]):
            if v >= p:
                v += 1
        sample[i] = v
    return sample

random.seed(100)
print(sample_wo_replacement(10, 5))
# [2, 8, 9, 7, 1]
print(sample_wo_replacement(10, 10))
# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]

This is a possible way to do it in TensorFlow (not sure if the best one):

import tensorflow as tf

def sample_wo_replacement_tf(n, k):
    # First loop
    sample = tf.constant([], dtype=tf.int64)
    i = 0
    sample, _ = tf.while_loop(
        lambda sample, i: i < k,
        # This is ugly but I did not want to define more functions
        lambda sample, i: (tf.concat([sample,
                                      tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],
                                     axis=0),
                           i + 1),
        [sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])
    # Second loop
    def inner_loop(sample, i):
        sample_size = tf.shape(sample)[0]
        v = sample[i]
        j = i - 1
        v, _ = tf.while_loop(
            lambda v, j: j >= 0,
            lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),
            [v, j])
        return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)
    i = tf.shape(sample)[0] - 1
    sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])
    return sample

And an example:

with tf.Graph().as_default(), tf.Session() as sess:
    tf.set_random_seed(100)
    sample = sample_wo_replacement_tf(10, 5)
    for i in range(10):
        print(sess.run(sample))
# [3 0 6 8 4]
# [5 4 8 9 3]
# [1 4 0 6 8]
# [8 9 5 6 7]
# [7 5 0 2 4]
# [8 4 5 3 7]
# [0 5 7 4 3]
# [2 0 3 8 6]
# [3 4 8 5 1]
# [5 7 0 2 9]

This is quite intesive on tf.while_loops, though, which are well-known not to be particularly fast in TensorFlow, so I wouldn't know how fast can you really get with this method without some kind of benchmarking.


EDIT 4:

One last possible method. You can divide the range of possible values (0 to n) in "chunks" of size c and pick a random amount of numbers from each chunk, then shuffle everything. The amount of memory that you use is limited by c, and you don't need nested loops. If n is divisible by c, then you should get about a perfect random distribution, otherwise values in the last "short" chunk would receive some extra probability (this may be negligible depending on the case). Here is a NumPy implementation. It is somewhat long to account for different corner cases and pitfalls, but if ck and n mod c = 0 several parts get simplified.

import numpy as np

def sample_chunked(n, k, chunk=None):
    chunk = chunk or n
    last_chunk = chunk
    parts = n // chunk
    # Distribute k among chunks
    max_p = min(float(chunk) / k, 1.0)
    max_p_last = max_p
    if n % chunk != 0:
        parts += 1
        last_chunk = n % chunk
        max_p_last = min(float(last_chunk) / k, 1.0)
    p = np.full(parts, 2)
    # Iterate until a valid distribution is found
    while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:
        p = np.random.uniform(size=parts)
        p /= np.sum(p)
    dist = (k * p).astype(np.int64)
    sample_size = np.sum(dist)
    # Account for rounding errors
    while sample_size < k:
        i = np.random.randint(len(dist))
        while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):
            i = np.random.randint(len(dist))
        dist[i] += 1
        sample_size += 1
    while sample_size > k:
        i = np.random.randint(len(dist))
        while dist[i] == 0:
            i = np.random.randint(len(dist))
        dist[i] -= 1
        sample_size -= 1
    assert sample_size == k
    # Generate sample parts
    sample_parts = []
    for i, v in enumerate(np.nditer(dist)):
        if v <= 0:
            continue
        c = chunk if i < parts - 1 else last_chunk
        base = chunk * i
        sample_parts.append(base + np.random.choice(c, v, replace=False))
    sample = np.concatenate(sample_parts, axis=0)
    np.random.shuffle(sample)
    return sample

np.random.seed(100)
print(sample_chunked(15, 5, 4))
# [ 8  9 12 13  3]

A quick benchmark of sample_chunked(100000000, 100000, 100000) takes about 3.1 seconds in my computer, while I haven't been able to run the previous algorithm (sample_wo_replacement function above) to completion with the same parameters. It should be possible to implement it in TensorFlow, maybe using tf.TensorArray, although it would require significant effort to get it exactly right.

jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • Yes, that's one possible solution, although I'm not really sure whether that's the best option. And I wonder depending on `sample_size`/`k` and `n`, how much loop iterations it needs on average. Do you know? It should be possible to calculate the expected value of that. Btw, in my question, I do not have `my_tensor` at all (which would also not be possible to have if `n` is too large). I'm just interested in what you called `idx[:sample_size]`. – Albert Mar 29 '18 at 18:18
  • @Albert I've extended the answer with further details about the repeated random method and another possible method with a fixed number of steps (although I'm not sure how fast it can really be). – jdehesa Apr 02 '18 at 17:39
  • Yes, it's easy to come up with O(k^2) solutions. But if possible, I want to avoid O(k^2). Only if you can proof that there is no faster solution that O(k^2), I must accept it. – Albert Apr 03 '18 at 09:11
  • @Albert Well it would have been useful if you had explicitly stated in your question what you had already considered and why it wouldn't work for you (I've posted three methods and you hinted only at the first one in your post, but then said you had already considered the other two). I don't know about a proof, but I very much doubt O(_k_) is possible (at least keeping true uniform distribution), since libraries like NumPy implement this with a permutation of _n_. – jdehesa Apr 03 '18 at 09:25
  • @Albert New edit with yet another algorithm. I think this should suit your needs, although I have only worked out a NumPy implementation... – jdehesa Apr 03 '18 at 11:01
3

use the gumbel-max trick here: https://github.com/tensorflow/tensorflow/issues/9260

z = -tf.log(-tf.log(tf.random_uniform(tf.shape(logits),0,1))) 
_, indices = tf.nn.top_k(logits + z,K)

indices are what you want. This tick is so easy~!

Xiang Qi
  • 51
  • 2
  • Thanks for the link, this is interesting. Also [this](https://timvieira.github.io/blog/post/2014/08/01/gumbel-max-trick-and-weighted-reservoir-sampling/) has some more information. However, the logits / probabilities in my case are uniform, and as I said in my question, I cannot create such an array, because `n` could be very large. So this is not an option. – Albert Mar 10 '19 at 00:07
  • When N is huge, maybe we should create the logits and z matrix as SparseTensor, and also implement top_k() and random_uniform() working with SparseTensor. – Xiang Qi Mar 25 '20 at 03:52
  • @Albert Check out the answer by nokyotsu here: https://github.com/tensorflow/tensorflow/issues/9260. That samples without replacement, assuming uniform probabilities, without creating a tensor of shape n. – Brenden Petersen Oct 21 '20 at 18:05
0

The following works fairly fast on the GPU, and I did not encounter memory issues when using n~100M and k~10k (using NVIDIA GeForce GTX 1080 Ti):

def random_choice_without_replacement(n, k):
  """equivalent to 'numpy.random.choice(n, size=k, replace=False)'"""
  return tf.math.top_k(tf.random.uniform(shape=[n]), k, sorted=False).indices
Amit Portnoy
  • 5,957
  • 2
  • 29
  • 30