4

I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.

What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.

Thanks for any sugestions!

Here's my current solution.

import numpy as np
import itertools as itr
import random as rdm

#works great
x_count = 10
y_count = 5

#good luck!
#x_count = 10000
#y_count = 20000

x_indices = np.arange(x_count)
y_indices = np.arange(y_count)

cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)

for i in range(25):
    print list_cell_indices[i]

So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:

import numpy as np
import itertools as itr
import random as rdm

x_count = 10000
y_count = 5000

sample_count = 10000
keep_probability = 0.01


tried_cells = set()
kept_cells = set()

while len(kept_cells) < sample_count:
    x = rdm.randint(0, x_count)
    y = rdm.randint(0, y_count)

    if (x, y) in tried_cells:
        pass
    else:
        tried_cells.add((x, y))
        keep = rdm.random() < keep_probability
        if keep:
            kept_cells.add((x,y))


print "worked"

In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.

Colin Talbert
  • 454
  • 6
  • 20
  • If you are having to sample all of them why are you asking about this at all? Referring to "as many as all". – zortacon May 23 '12 at 19:29
  • 3
    If you have 10K x 20K cells, and you have to sample them all, randomly, and without replacement, I don't think there's a way around requiring `O(x_count * y_count)` memory. – NPE May 23 '12 at 19:33
  • How good does the randomness have to be? Statistically random? Cryptographically secure? You could create a random-ish permutation from `range(x_count * y_count)` to `range(x_count * y_count)` and then just evaluate that for as many integers as you need. See http://stackoverflow.com/questions/3131193/symmetric-bijective-algorithm-for-integers – Mark Dickinson May 23 '12 at 19:53
  • In most case I don't have to sample all or even all that many. 10,000 out of 100,000,000 say. What I'm doing is checking if each cell meets a condition. If it doesn't then I throw it out and get the next one. In some cases the condition is improbable and I end up throwing out many many cells. Worst case is the condition is so improbable that I end up throwing out all my cells and throwing an error. – Colin Talbert May 23 '12 at 21:46
  • The randomness has to be statistically sound. I looked at your cross ref Mark but honestly it was a bit over my head. – Colin Talbert May 23 '12 at 22:15
  • The solution I ended up using was a combination of the three non-perl answers. Basically I can determine how many points I'll need to sample to get my desired sample size by calculating the average probability of success across all cells. If this number is less than a third of the total I'll use my answer above. If the percent is greater than 1/3 then I'll apply the function to all the cells and take a random sample of just those that pass. This is akin to what blcknight suggests. Thanks for all you're help too bad I couldn't check the answer from all three of you. – Colin Talbert May 31 '12 at 22:10

5 Answers5

4

I believe that there is simply no way to sample a sequence without replacement without using a significant amount of auxiliary storage for sample sizes close to R * C. And while there are clever ways to reduce the amount of storage for small sample sizes, if you expect to be sampling more than about a third of your dataset, you're better off just creating a separate list. random.sample is a natural choice for this purpose; and frankly I would just pass a flattened version of your 2-d numpy array straight to it. (Unless you need the indices too, in which case, randomly sampling ints and translating them into coordinates, a la hexparrot's solution, is a reasonable way to go.)

>>> a = numpy.arange(25).reshape((5, 5))
>>> random.sample(a.ravel(), 5)
[0, 13, 8, 18, 4]

If you look at the implementation of random.sample, you'll see that for smaller sample sizes, it already does roughly what the perl code above does -- tracks previously selected items in a set and discards selections that are already in the set. For larger sample sizes, it creates a copy of the input -- which is more memory efficient than a set for larger values, because sets take up more space than lists per stored item -- and does a slightly modified Fisher-Yates shuffle, stopping when it has sample_size items (i.e. it doesn't shuffle the whole list, so it's more efficient than shuffling the whole thing yourself.)

Basically, my bet is that you won't do better than random.sample, rolling your own, unless you code something up in c.

However -- I did find this, which you might find interesting: numpy.random.choice. This appears to do random sampling with or without replacement at c speed. The catch? It's new with Numpy 1.7!

Community
  • 1
  • 1
senderle
  • 145,869
  • 36
  • 209
  • 233
3

How about this approach. I create first the x*y array and reshape it to 2-D. Then, knowing each cell can be uniquely identified by a single integer, get a sample from 0 to (x*y).

import numpy

x_count = 10000
y_count = 20000

x_indices = numpy.arange(x_count)
y_indices = numpy.arange(y_count)

large_table = numpy.arange(y_count * x_count).reshape(y_count, x_count)
print large_table

def get_random_item(sample_size):
    from random import sample
    for i in sample(xrange(y_count * x_count), sample_size):
        y,x = divmod(i, y_count)
        yield (x,y)

for x,y in get_random_item(10):
    print '%12i   x: %5i y: %5i' % (large_table[x][y],  x,y)

Which returns:

(first to simulate your existing 2-D array you created via product)

[[        0         1         2 ...,      9997      9998      9999]
 [    10000     10001     10002 ...,     19997     19998     19999]
 [    20000     20001     20002 ...,     29997     29998     29999]
 ..., 
 [199970000 199970001 199970002 ..., 199979997 199979998 199979999]
 [199980000 199980001 199980002 ..., 199989997 199989998 199989999]
 [199990000 199990001 199990002 ..., 199999997 199999998 199999999]]

Then, it returns the 2-dim coordinates, which can be translated into your cell contents simply via array[x][y]

   154080675   x: 15408 y:   675
   186978188   x: 18697 y:  8188
   157506087   x: 15750 y:  6087
   168859259   x: 16885 y:  9259
    29775768   x:  2977 y:  5768
    94167866   x:  9416 y:  7866
    15978144   x:  1597 y:  8144
    91964007   x:  9196 y:  4007
   163462830   x: 16346 y:  2830
    62613129   x:  6261 y:  3129

sample() states it is 'Used for random sampling without replacement' and this approach adheres to the advice 'This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).' found on the python random page.

I note that while I use get_random_item() as a generator, the underlying sample() still is producing a full list, so the memory use is still y*x + sample_size, but it runs quite swiftly all the same.

hexparrot
  • 3,399
  • 1
  • 24
  • 33
  • 2
    FYI, once the sample size increases beyond a certain size (dictated by the relative size of an n-item list vs. a k-item set), [`random.sample`](http://hg.python.org/cpython/file/2.7/Lib/random.py#l290) _also_ copies the entire population to be sampled into a list. So this is likely to be inefficient for large n and k. – senderle May 23 '12 at 23:12
  • Thanks for the information; I wasn't aware of that implementation detail. – hexparrot May 23 '12 at 23:31
  • But then, +1 -- there's no better answer than `random.sample` in some form, imo -- apart from [numpy.random.choice](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html), new in Numpy 1.7 – senderle May 23 '12 at 23:32
1

You're already using O(N=R*C) memory, so you might as well use O(N) memory for your iterator. Copy all elements and randomly sort them as you would do for the single-dimensional case. This is only a reasonable thing to do if you are going to visit each element, which you say is your case.

(for the record: Otherwise, memory isn't such a bad issue because you only need to "remember" where you were previously. Thus you can keep a list of indices that you've already visited. This is bad if you ever do plan to visit each element, because rejection sampling can take very long if implementing with a growing blacklist. (You can also implement it with a decreasing whitelist, which is equivalent to the first solution.))

ninjagecko
  • 88,546
  • 24
  • 137
  • 145
1

You've said that your test is likely to fail for most of the cells in your grid. If that's the case, randomly sampling the cells may not be a good idea to do, since you'll have a hard time keeping track of the cells you've checked already without using a lot of memory.

Instead, you might be better off applying your test to the whole grid and selecting a random element from those that pass it.

This function returns a random element that passes the test (or None if they all fail). It uses very little memory.

def findRandomElementThatPasses(iterable, testFunc):
    value = None
    passed = 0

    for element in iterable:
        if testFunc(element):
            passed += 1
            if random.random() > 1.0/passed:
                value = element

    return value

You'd call it with something like:

e = findRandomElementThatPasses((x,y) for x in xrange(X_SIZE)
                                      for y in xrange(Y_SIZE),
                                someFunctionTakingAnXYTuple)

If you're using Python 3, use range instead of xrange.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
0

in perl you could do something like this:

# how many you got and need
$xmax = 10000000;
$ymax = 10000000;
$sampleSize = 10000;

# build hash, dictionary in python
$cells = {};
$count = 0;
while($count < $sampleSize) {
  $x = rand($xmax);
  $y = rand($ymax);
  if(! $cells->{$x}->{$y}) {
    $cells->{$x}->{$y}++;
    $count++;
  }
}
# now grab the stuff
foreach ($x keys %{$cells}) {
  foreach ($y keys %{$cells->{$x}}) {
    getSample($x, $y);
  }
}

No duplicates, pretty random, not too hard on memory.

zortacon
  • 617
  • 5
  • 15
  • To quote the OP: *Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.* – NPE May 23 '12 at 19:39
  • 1
    "pretty random" is a *very* dangerous statement. Do you have any metric of whether the your sampling distribution is indeed flat? – ev-br May 23 '12 at 19:43
  • @Zhenya: As long as `rand()` is uniform, I don't think there is anything wrong with the distributional properties of this solution. Do you have any reasons to think otherwise? – NPE May 23 '12 at 19:46
  • -0. Not sure what the point is of a perl answer when it's a python question. – Steven Rumbalski May 23 '12 at 19:50
  • @aix: Yes, I have. I can't comment on the Perl code, I simply don't know it at all. That's why it's a question: are you sure? There are numerous examples of simple hacks or tricks which ruin completely even a good random number generator, in often subtle ways. Hence I'd advocate for a rule of thumb: if doing anything non-trivial with rand(), at the very least have a look at the distribution. And yes, default rand()-s are very often very bad. – ev-br May 23 '12 at 19:57