I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.
What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.
Thanks for any sugestions!
Here's my current solution.
import numpy as np
import itertools as itr
import random as rdm
#works great
x_count = 10
y_count = 5
#good luck!
#x_count = 10000
#y_count = 20000
x_indices = np.arange(x_count)
y_indices = np.arange(y_count)
cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)
for i in range(25):
print list_cell_indices[i]
So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:
import numpy as np
import itertools as itr
import random as rdm
x_count = 10000
y_count = 5000
sample_count = 10000
keep_probability = 0.01
tried_cells = set()
kept_cells = set()
while len(kept_cells) < sample_count:
x = rdm.randint(0, x_count)
y = rdm.randint(0, y_count)
if (x, y) in tried_cells:
pass
else:
tried_cells.add((x, y))
keep = rdm.random() < keep_probability
if keep:
kept_cells.add((x,y))
print "worked"
In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.