2

I tried to implement a simple reservoir sampling in haskell following http://jeremykun.com/2013/07/05/reservoir-sampling/ (note that the algorithm shown is possibly semantically incorrect)

According to this: Iterative or Lazy Reservoir Sampling lazy reservoir sampling is impossible unless you know the population size ahead of time.

Even so, I'm not understanding why (operationally speaking) the below sampleReservoir doesn't work on infinite lists. Just where exactly is laziness broken?

import System.Random (randomRIO)

-- equivalent to python's enumerate
enumerate :: (Num i, Enum i) => i -> [e] -> [(i, e)]
enumerate start = zip [start..]

sampleReservoir stream = 
    foldr 
        (\(i, e) reservoir -> do 
            r <- randomRIO (0.0, 1.0) :: IO Double
                              -- randomRIO gets confused about 0.0 and 1.0
            if r < (1.0 / fromIntegral i) then
                fmap (e:) reservoir
            else 
                reservoir) 
        (return []) 
        (enumerate 1 stream)

The challenge and test is fmap (take 1) $ sampleReservoir [1..].

Furthermore, if reservoir sampling can't be lazy, what can take in a lazy list and produce a sampled lazy list?

I get the idea that there must be a way of making the above function lazy in the output as well, because I could change this:

if r < (1.0 / fromIntegral i) then
    fmap (e:) reservoir
else 
    

To:

if r < (1.0 / fromIntegral i) then
    do 
        print e
        fmap (e:) reservoir

This shows results as the function is iterating over the list. Using coroutine abstraction, perhaps instead of print e there can be a yield e, and the rest of the computation can be held as a continuation.

Will Ness
  • 70,110
  • 9
  • 98
  • 181
CMCDragonkai
  • 6,222
  • 12
  • 56
  • 98
  • well the algorithms holds one coin till it's at the end (and maybe changes it with some other) - the key point is: *till the end* - so you have to traverse the complete list ... and of course this algorithm depends on exactly this in order to give you an uniform distribution – Random Dev Nov 25 '15 at 16:11
  • From the way it's written above, it doesn't seem like anything is getting replaced. That's something that was confusing me, why Jeremy Kun's example didn't actually maintain any particular reservoir to be mutated. So I'm not understanding your comment. – CMCDragonkai Nov 25 '15 at 16:19
  • I am talking about the *reservoir-sampling* algorithm as described in your first link - seems you removed the vital parts - but yours is not working becaue you try to materialize the complete list - why don't you try and do it without `IO` (pass the generator around with [`randomR`](https://hackage.haskell.org/package/random-1.1/docs/System-Random.html#v:randomR) ? – Random Dev Nov 25 '15 at 16:32
  • Operationally, your `sampleReservoir` can be made to be suitably lazy; however, it has the wrong denotation. It does not produce an element drawn uniformly at random from the input list. Indeed, it produces a list and not an element; but even the list it produces does not contain uniformly distributed values. – Daniel Wagner Nov 25 '15 at 16:49

1 Answers1

4

The problem is that the IO monad maintains a strict sequence between actions. Writing fmap (e:) reservoir will first execute all of the effects associated with reservoir, which will be infinite if the input list is infinite.

I was able to fix this with liberal use of unsafeInterleaveIO, which allows you to break the semantics of IO:

sampleReservoir2 :: [e] -> IO [e]
sampleReservoir2 stream = 
    foldr 
        (\(i, e) reservoir -> do 
            r <- unsafeInterleaveIO $ randomRIO (0.0, 1.0) :: IO Double -- randomRIO gets confused about 0.0 and 1.0
            if r < (1.0 / fromIntegral i) then unsafeInterleaveIO $ do
                rr <- reservoir
                return (e:rr)
            else 
                reservoir) 
        (return []) 
        (enumerate 1 stream)

Obviously, this will allow the interleaving of IO actions, but since all you're doing is generating random numbers it shouldn't matter. However, this solution isn't very satisfactory; the correct solution is to refactor your code somewhat. You should generate an infinite list of random numbers, then consume that infinite list (lazily) with foldr:

sampleReservoir3 :: MonadRandom m => [a] -> m [a]
sampleReservoir3 stream = do
  ws <- getRandomRs (0, 1 :: Double) 
  return $ foldr 
     (\(w, (i, e)) reservoir -> 
        (if w < (1 / fromIntegral i) then (e:) else id) reservoir
     ) 
     []
     (zip ws $ enumerate 1 stream)

This can also (equivalently) be written as

sampleReservoir4 :: [a] -> IO [a] 
sampleReservoir4 stream = do
  seed <- newStdGen 
  let ws = randomRs (0, 1 :: Double) seed 
  return $ foldr 
     (\(w, (i, e)) reservoir -> 
        (if w < (1 / fromIntegral i) then (e:) else id) reservoir
     ) 
     []
     (zip ws $ enumerate 1 stream)

As an aside, I'm not sure as to the correctness of the algorithm, since it seems to always return the first element of the input list first. Not very random.

user2407038
  • 14,400
  • 3
  • 29
  • 42
  • in this version the prop. to return the first is extremely high (a random number in [0,1] < 1) ;) – Random Dev Nov 25 '15 at 16:36
  • That's cool! Yea I agree, I am bit wary about this algorithm, it seems strange as the probability starts high and goes very low. – CMCDragonkai Nov 25 '15 at 16:50
  • I revisited this answer in more detail now, does the refactored version work because instead of fmapping into a infinitely nested IO structure (forcing a strict evaluation from inside out), we instead have a list data structure that simply contains a random value produced by IO computation. And somehow this allows lazy consumption of IO values when deconstructing the list? – CMCDragonkai Nov 27 '15 at 06:46
  • In `sampleReservoir3` there is no lazy consumption of IO computation. When you write `ws <- getRandomRs (0, 1 :: Double)` you get a single IO computation which, when executed, produces an infinite random list of doubles between 0 and 1. Essentially what happens is that `getRandomRs` performs an IO action to get a new random seed, then *purely* and *lazily* returns an infinite list of random values using that seed. I've added another equivalent version of the function. – user2407038 Nov 27 '15 at 12:23