Generate more evenly distributed random sequence

Question

I have read this question and thought that this algorithm is not optimal. For example, 'f 20 100' returns lists like [85,14,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]; I frequently get a long zero tail as the result.

Well I thought it is an interesting task and decided to create my own implementation :)

I decided to divide the number in the random proportion:

g 1 sum = return [sum]
g n sum = do
    prop <- randomRIO(0.0, 1.0)
    k1 <- g (round prop * n) (round( prop * sum))
    k2 <- g (n - (round prop * n)) (sum - (round prop * sum))
    return k1 ++ k2

But my code doesn't work:

   Couldn't match expected type `IO [a0]' with actual type `[a1]'
    In the expression: return k1 ++ k2
    In the expression:
      do { prop <- randomRIO (0.0, 1.0);
           k1 <- g (round prop * n) (round (prop * sum));
           k2 <- g (n - (round prop * n)) (sum - (round prop * sum));
             return k1 ++ k2 }
    In an equation for `g':
        g n sum
          = do { prop <- randomRIO (0.0, 1.0);
                 k1 <- g (round prop * n) (round (prop * sum));

                 k2 <- g (n - (round prop * n)) (sum - (round prop * sum));
                 .... }

As I see I cannot concatenate IO lists. How can I fix it ?

have you seen [my answer](http://stackoverflow.com/a/13193995/849891) at that question? what do you think of its proposed solution? — Will Ness, Nov 02 '12 at 12:17
You might like [Distributing cookies](http://mathlesstraveled.com/2009/04/01/distributing-cookies/) and [Distributing cookies: solutions](http://mathlesstraveled.com/2009/04/14/distributing-cookies-solutions/). Somewhere around here I've got some code I convinced Brent to write for me that actually chooses uniformly from among all possible lists, which I will post as an answer as soon as I find it. — Daniel Wagner, Nov 02 '12 at 16:20

score 6 · Accepted Answer · answered Nov 02 '12 at 10:41

The type error you ask about is caused by the fact that you should write

return (k1 ++ k2)

rather than

return k1 ++ k2

Note that return is just a function in Haskell, and function application binds stronger than any other infix operator, so your code reads to Haskell as if you had written

(return k1) ++ k2

Note, however, that there are further issues with your code.

AndrewC · Answer 2 · 2012-11-02T20:12:52.917

First let's do a few imports for later:

import Control.Applicative
import Control.Monad
import System.Random
import Data.List hiding (partition)

Code Fix

Always remember function application has higher precedence than infix operators: return k1 ++ k2 means (return k1) ++ k2 and round prop * n means (round prop) * n. You can use $ to separate a function from the expression you're applying it to, because f $ x = f x but $ has really low precedence. You can use return $ k1 ++ k2, for example.

You were getting your Ints and Doubles mixed up a bit (round prop * n) rounds the proportion before multiplying, but you want to multiply first, so you'd need to apply fromIntegral to the n. I made a seperate function for this

(.*) :: Double -> Int -> Int
d .* i = floor $ d * fromIntegral i

So now instead of (round prop * n) you can use (prop .* n). It cleans up the code a bit, and means if it's wrong, we can fix it in one function instead of all over the place.

I've provided a type signature to make error messages more informative, and also a second base case - it wasn't terminating because sometimes rounding caused it to ask for a list of length 0.

partition1 :: Int -> Int -> IO [Int]
partition1 0 total = return []
partition1 1 total = return [total]
partition1 n total = do
    prop <- randomRIO(0.0, 1.0)
    k1 <- partition1 (prop .* n) (prop .* total)
    k2 <- partition1 (n - (prop .* n)) (total - (prop .* total))
    return $ k1 ++ k2

I also took the liberty of giving it a more descriptive name.

Getting the right totals

Unfortunately, this compiles, but as Will Ness pointed out in comments, there's a glitch: it typically gives you numbers that total to less than the total. It turns out that's because you will call partition 0 n for non-zero n, asking for a list of length 0 that sums to something non-zero. Oops.

The idea behind your algorithm is to split the list and the total randomly, but keep the proportions the same for both, to keep the distribution from being one-sided, (the problem in the original question).

Let's use that idea, but prevent it from asking for a length zero - we'll need prop to be neither 0 nor 1.

partition2 :: Int -> Int -> IO [Int]
partition2 0 total = return []
partition2 1 total = return [total]
partition2 n total = do
    new_n <- randomRIO(1,n-1)
    let prop = fromIntegral new_n / fromIntegral n
    k1 <- partition2 new_n (prop .* total)
    k2 <- partition2 (n - new_n) (total - (prop .* total))
    return $ k1 ++ k2

Now it can never give us the wrong total. Hooray!

Random isn't the same as fair

But oops: partition2 18 10000 gives us

[555,555,555,555,556,556,555,556,556,556,555,555,556,556,555,556,556,556]

The problem is that fair isn't the same as random. This algorithm is very fair, but not very random. Let's let it choose the proportion seperately from the length:

partition3 :: Int -> Int -> IO [Int]
partition3 0 total = return []
partition3 1 total = return [total]
partition3 n total = do
    new_n   <- randomRIO(1,n-1)
    new_total <- randomRIO(0,total)  -- it's fine to have zeros.
    k1 <- partition3 new_n new_total
    k2 <- partition3 (n - new_n) (total - new_total)
    return $ k1 ++ k2

That looks better: partition3 15 20000 gave me

[1134,123,317,725,1031,3897,8089,2111,164,911,25,0,126,938,409]

Random isn't fair, but it isn't biased either

This is apparently much better, but essentially the binary partitioning we're doing is introducing bias.

You can test a lot of runs by looking at

check :: (Int -> Int -> IO [Int]) -> Int -> Int -> Int -> IO ()
check f n total times = mapM_ print =<< map average.transpose.map (righttotal total) <$> replicateM times (f n total)
   where average xs = fromIntegral (sum xs)/fromIntegral total

righttotal tot xs | sum xs == tot = xs
                  | otherwise = error $ "wrong total: " ++ show (sum xs)

which on one run of check partition3 11 10000 1000 gave me

Without going into loads of test data and analysis, interesting though it was, there's a disproportionate amount of 0 when n isn't a factor of total, and the distribution isn't even, it's cup shaped - the algorithm ends up cramming data at one end.

A way out

Instead of choosing bit by bit how much is going in sublists, let's generate all the places a subtotal will end all at once. Of course one of them has to be the total, and we'd better sort them once we generate them.

stopgaps :: Int -> Int -> IO [Int]
stopgaps parts total = sort.(total:) <$> replicateM (parts-1) (randomRIO (0,total))

Here I use replicateM :: Int -> m a -> m [a] to generate parts-1 random numbers in the correct range.

I'd like to plug an unsung hero:

mapAccumL :: (acc -> x -> (acc, y)) -> acc -> [x] -> (acc, [y])

for accumulating along a list, generating a new list.

gapsToLengths :: [Int] -> (Int,[Int])
gapsToLengths = mapAccumL between 0
   where between previous new = (new,new - previous)

partition4 :: Int -> Int -> IO [Int]
partition4 parts total = snd.gapsToLengths <$> stopgaps parts total

Does it work?

A few test runs of partition4 11 10000, pretty printed :

[ 786,   20,  607,  677, 1244, 1137,  990,   50, 1716,  813, 1960]
[ 406,  110, 2556,  126, 1289,  567,  348, 1230,  171,  613, 2584]
[ 368, 1794,  136, 1266,  583,   93, 1514,   66, 1594, 1685,  901]
[ 657, 1296, 1754,  411,  691, 1865,  531,  270, 1941,  286,  298]
[2905,  313,  842,  796,  698, 1104,   82, 1475,   22,  619, 1144]
[1411,  966,  530,  129,   81,  561, 1779, 1179,  301,  607, 2456]
[1143,  409,  903,   27,  855,  354,  887, 1898, 1880,  301, 1343]
[ 260,  643,   96,  323,  142,   74,  401,  977, 3685, 2690,  709]
[1350,  979,  377,  765,  137, 1295,  615,  592, 2099, 1088,  703]
[2411,  958,  330, 1433, 1355,  680, 1075,   41,  988,   81,  648]

That looks random. Let's check there's no bias:

check partition4 11 10000 1000
92.6425
93.4513
92.3544
90.8508
88.0297
91.7731
88.7939
86.5268
86.3502
95.2499
93.9774

At last!

I tried this code with `*Main>x <- partition 26 301`, and got a list `[9,11,14,11,6,10,10,10,13,18,7,16,6,19,7,14,11,17,14,10,6,6,18,14,13,6]` of 26 numbers back, but it summed up to 296. — Will Ness, Nov 02 '12 at 14:41
Here's the sum of the returned list, for 20 consecutive runs: 283,276,267,240,252,258,252,290,255,254,286,274,281,281,261,261,274,264,281,242. THe length is always 26. — Will Ness, Nov 02 '12 at 14:54
I wonder how does this compare with [the other answer](http://stackoverflow.com/a/13200592/849891)...? — Will Ness, Nov 03 '12 at 10:21
@WillNess The other answer gives very similar performance for time and space in ghci (although ghci doesn't measure very accurately) and its distribution is unbiased. That's because it's essentially the same algorithm. The offset zipWith is slicker than mapAccumL, but I do like mapAccumL. — AndrewC, Nov 03 '12 at 11:05
thanks; I'm still unclear about the distribution - distribution of what? it can't be of numbers, after all only 1 `sum` (at the most) can appear in any given list, but there can be more than 1 zeroes... so distribution of numbers *should* be skewed to the left it seems, just like [here](http://stackoverflow.com/questions/5622608/choosing-n-numbers-with-fixed-sum)...(?) And there must be more 5s, say, than 0s, because 0s alone can't get to the required sum total. Your `check` just checks the average *at a position* in the list... (btw you should delete by `times` - it's *transposed*). — Will Ness, Nov 03 '12 at 15:45
about `check`: I used `average xs = z - (fromIntegral (sum xs)/fromIntegral times) where {z::Float; z = fromIntegral total/fromIntegral n}`. — Will Ness, Nov 03 '12 at 15:47
(I see about the distribution - you indeed meant among the positions in the list - the original problem in the original code.) :) — Will Ness, Nov 03 '12 at 15:51

Daniel Wagner · Answer 3 · 2012-11-03T10:37:25.753

0

Here's a section of a module I have for easing my use of QuickCheck. The interesting bits of the code are written by the peerless Brent Yorgey, and uses a binomial number system as described in the blog posts linked in my comments above. The pickDistribution function is some sample glue code for generating lists of non-negative numbers with a particular weight (you can use resize to choose a particular weight).

{-# LANGUAGE MultiParamTypeClasses #-}
module QuickCheckUtils where

import Control.Monad.Reader
import Test.QuickCheck
import Test.QuickCheck.Gen

instance MonadReader Int Gen where
    ask = MkGen (\r n -> n)
    local f (MkGen g) = MkGen (\r -> g r . f)

-- pickDistribution n chooses uniformly at random from all lists of length n of
-- non-negative numbers that sum to the current weight
pickDistribution :: Int -> Gen [Int]
pickDistribution n = do
    m <- ask
    let j = fromIntegral (m+n-1)
        k = fromIntegral (n-1)
    i <- choose (1, binom j k)
    return . map fromIntegral . combToComposition $ toComb j k (i-1)

-- code from Brent {{{
-- Comb n cs represents a choice cs of distinct numbers between 0 and
-- (n-1) inclusive.
data Comb = Comb Integer [Integer] deriving Show
type Comp = [Integer]

-- Convert a choice of (n-1) out of (m+n-1) things into a composition
-- of m, that is, an ordered list of natural numbers with sum m.
combToComposition :: Comb -> Comp
combToComposition (Comb n cs) = map pred $ zipWith (-) cs' (tail cs')
    where cs' = [n] ++ cs ++ [-1]

-- Convert a number into "base binomial", i.e. generate the
-- ith combination in lexicographical order.  See TAOCP 7.2.1.3, Theorem L.
toComb :: Integer -- ^ Total number of things
       -> Integer -- ^ Number to select
       -> Integer -- ^ Index into the lexicographic ordering of combinations
       -> Comb    -- ^ Corresponding combination
toComb n k i = Comb n (toComb' k i (n-1) (binom (n-1) k))

binom _ 0 = 1
binom 0 _ = 0
binom n k = binom (n-1) (k-1) * n `div` k

toComb' 0 _ _ _ = []
toComb' k i j jCk
    | jCk > i   =     toComb' k     i         (j-1) (jCk * (j-k) `div` j)
    | otherwise = j : toComb' (k-1) (i - jCk) (j-1) (jCk *     k `div` j)
-- }}}

edited Nov 03 '12 at 10:37

answered Nov 02 '12 at 17:43

Daniel Wagner

145,880
9
220
380

I think that could do with a little unpacking, or at least a `usage` statement. – AndrewC Nov 02 '12 at 22:34
On the face of it, you're choosing without repetition, so the probability of any of your numbers being 0 is 0. – AndrewC Nov 02 '12 at 22:36
@AndrewC The comment on `pickDistribution` is the usage statement. The unpacking is available, as I mentioned in the answer, in the blog posts and TAoCP. While it's true that this particular glue code chooses all positive numbers, choosing all non-negative numbers is an easy modification: ask for lists of length `n` that add up to the desired sum plus `n`, then subtract one from each. – Daniel Wagner Nov 02 '12 at 23:32
Good workaround; perhaps that code would be worth adding to your answer? (But in a different code block!) – AndrewC Nov 02 '12 at 23:59
It's not a usage statement. `sample' (pickDistribution 2)` gave me `[[0,0],[1,1],[3,1],[6,0],[3,5],[5,5],[12,0],[9,5],[4,12],[3,15],[10,10]]`. "the current weight" appears to be `[0,2..20]` every time. How do you use this to calculate a list of length 5 that sums to 21? Or a list of length 20 that sums to 100? It's a lovely piece of maths and the blog was a great read, thanks, but please provide an explicit solution to the problem of replacing f or g. – AndrewC Nov 03 '12 at 00:36
@AndrewC You can use [resize](http://hackage.haskell.org/packages/archive/QuickCheck/2.5.1.1/doc/html/Test-QuickCheck.html#v:resize) to choose the weight. – Daniel Wagner Nov 03 '12 at 10:23
..and how do you choose the length? – AndrewC Nov 03 '12 at 11:18
@AndrewC As the "usage statement" already says, "`pickDistribution n` chooses uniformly at random from all lists of length `n`..." ;-) – Daniel Wagner Nov 03 '12 at 12:38
Sorry, I meant "How do you choose how many lists to return?" but I found it: `head <$> sample' (vectorOf samplesize (resize total (pickDistribution n)))` – AndrewC Nov 03 '12 at 13:56