Given an array of n
word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi
is a word, fi
is an integer frequencey, and the sum of the frequencies ∑fi = m
,
I want to use a pseudo-random number generator (pRNG) to select p
words wj0, wj1, ..., wjp-1
such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I've come up with three algorithms so far:
Create an array of size
m
, and populate it so the firstf0
entries arew0
, the nextf1
entries arew1
, and so on, so the lastfp-1
entries arewp-1
.[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
Then use the pRNG to selectp
indices in the range0...m-1
, and report the words stored at those indices.
This takesO(n + m + p)
work, which isn't great, sincem
can be much much larger than n.Step through the input array once, computing
mi = ∑h≤ifh = mi-1 + fi
and after computingmi
, use the pRNG to generate a numberxk
in the range0...mi-1
for eachk
in0...p-1
and selectwi
forwjk
(possibly replacing the current value ofwjk
) ifxk < fi
.
This requiresO(n + np)
work.- Compute
mi
as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
and then, for each k in0...p-1
, use the pRNG to generate a numberxk
in the range0...m-1
then do binary search on the array of triples to find thei
s.t.mi-fi ≤ xk < mi
, and selectwi
forwjk
.
This requiresO(n + p log n)
work.
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
...
blocks (for inline) or blocks (for fullline). – rampion May 16 '09 at 15:34