I tried to implement a simple reservoir sampling in haskell following http://jeremykun.com/2013/07/05/reservoir-sampling/ (note that the algorithm shown is possibly semantically incorrect)
According to this: Iterative or Lazy Reservoir Sampling lazy reservoir sampling is impossible unless you know the population size ahead of time.
Even so, I'm not understanding why (operationally speaking) the below sampleReservoir
doesn't work on infinite lists. Just where exactly is laziness broken?
import System.Random (randomRIO)
-- equivalent to python's enumerate
enumerate :: (Num i, Enum i) => i -> [e] -> [(i, e)]
enumerate start = zip [start..]
sampleReservoir stream =
foldr
(\(i, e) reservoir -> do
r <- randomRIO (0.0, 1.0) :: IO Double
-- randomRIO gets confused about 0.0 and 1.0
if r < (1.0 / fromIntegral i) then
fmap (e:) reservoir
else
reservoir)
(return [])
(enumerate 1 stream)
The challenge and test is fmap (take 1) $ sampleReservoir [1..]
.
Furthermore, if reservoir sampling can't be lazy, what can take in a lazy list and produce a sampled lazy list?
I get the idea that there must be a way of making the above function lazy in the output as well, because I could change this:
if r < (1.0 / fromIntegral i) then
fmap (e:) reservoir
else
To:
if r < (1.0 / fromIntegral i) then
do
print e
fmap (e:) reservoir
This shows results as the function is iterating over the list. Using coroutine abstraction, perhaps instead of print e
there can be a yield e
, and the rest of the computation can be held as a continuation.