SQL: generate N random non-negative integers adding up to a provided total M

Question

The problem of generating random integers adding up to a given total keeps popping up in different programming languages and there are SO solutions for R, Java and Python.

This question seeks a "vanilla" SQL solution restricted to a single SELECT statement with optional common table expressions [CTEs].

The input is a 1x2 table call inputs: a single row with columns M and N of int type. The result should be an Nx1 table with a single column i for the integers that add up to M.

Thanks for the information. But here you are supposed to ask a question — RiggsFolly, Jan 13 '17 at 18:22
+++ This is actually a really tough SQL question. I'll take the challenge! — Steven Moseley, Jan 13 '17 at 18:35
@RiggsFolly this question asks for a SQL solution using specific language capabilities. It specifies the input and the output. It highlights SO solutions for three other languages as an aid to answering the question. What exactly is missing, in your opinion? A question mark? — Sim, Jan 13 '17 at 19:09
@Strawberry why relates to choice sampling that is part of a complex big data analytics pipeline. The details are not relevant to the question. In fact, I've purposefully abstracted the problem statement to make it generic and useful for other contexts. — Sim, Jan 13 '17 at 19:37
sql is for the storage and retrieval of data, and nothing else. It is not a panacea. — Strawberry, Jan 13 '17 at 22:05
No, whats missing is any attempt to solve the problem for yourself! Here at SO we help you fix your attempts at solving a problem, we dont write code for you for free — RiggsFolly, Jan 13 '17 at 22:32
Is the zero value allowed? @StevenMoseley Under which RDBMS do you get this behaviour? — Fabian Pijcke, Jan 14 '17 at 10:28
@FabianPijcke 0 value must be allowed. I have updated the question with this information. — Sim, Jan 14 '17 at 21:08

Fabian Pijcke · Accepted Answer · 2017-01-14T10:14:49.573

I propose the following query (PostgreSQL):

WITH ZeroToOne (m, n, y) AS (
  SELECT m, n, random()
  FROM inputs
  CROSS JOIN generate_series(1, n)
), SumToM (m, n, y, x) AS (
  SELECT m, n, y, y * m / sum(y) OVER (PARTITION BY m, n)
  FROM zerotoone
), MissingToM (m, n, l) AS (
  SELECT m, n, m - sum(floor(x))
  FROM sumtom
  GROUP BY m, n
)
SELECT m, n, y, x, l,
  CASE
    WHEN row_number() OVER (PARTITION BY m, n ORDER BY x - floor(x) DESC) > l
    THEN floor(x)
    ELSE ceil(x)
  END AS v
FROM missingtom
NATURAL JOIN sumtom;

The only interesting values are m, n and v; I left the other values for explanation purposes.

I will step through the query with the following input cases as running examples:

SELECT * FROM inputs;
 m  | n 
----+---
 20 | 4
 30 | 4
 42 | 3
(3 rows)

The first CTE (ZeroToOne) computes n random values in the range [0, 1] for each input case and call these values y:

 m  | n |          y          
----+---+---------------------
 20 | 4 |   0.374425032641739 
 20 | 4 |   0.644279096741229 
 20 | 4 |   0.626386553514749 
 20 | 4 |   0.320786282420158 
 30 | 4 |   0.848764919675887 
 30 | 4 |   0.268079651053995 
 30 | 4 |   0.250213726423681 
 30 | 4 |   0.497460773680359 
 42 | 3 |   0.571454062592238 
 42 | 3 | 0.00338772451505065 
 42 | 3 |   0.139226260595024

The second CTE (SumToM) multiplies each y value by mand divides the result by the sum of values for the input case. As a result, summing up all of the x for an input couple (m, n) gives m:

 m  | n |          y          |         x         
----+---+---------------------+-------------------
 20 | 4 |   0.374425032641739 |  3.80924177094873 
 20 | 4 |   0.644279096741229 |  6.55462277759638 
 20 | 4 |   0.626386553514749 |  6.37259161753762 
 20 | 4 |   0.320786282420158 |  3.26354383391728 
 30 | 4 |   0.848764919675887 |  13.6565766414436 
 30 | 4 |   0.268079651053995 |   4.3133855037604 
 30 | 4 |   0.250213726423681 |  4.02592384820881 
 30 | 4 |   0.497460773680359 |  8.00411400658722 
 42 | 3 |   0.571454062592238 |  33.6117414945302 
 42 | 3 | 0.00338772451505065 | 0.199258922297338 
 42 | 3 |   0.139226260595024 |  8.18899958317244

It is obvious that m is greater than the sum of the integer parts of the x values. It is also easy to see that the difference between the two sums (sum of the x values and sum of the integer parts of the x values) is less than n. So the idea now is to count hom many numbers have to be rounded up and how many have to be rounded down. The l value of the third CTE (MissingToM) is the number of values to be rounded up:

 m  | n | l 
----+---+---
 20 | 4 | 2 
 30 | 4 | 1 
 42 | 3 | 1

To ensure that the distribution of numbers remains uniform, we round up the numbers which have the highest fractional part with the final query:

 m  | n |          y          |         x         | l | v  
----+---+---------------------+-------------------+---+----
 20 | 4 |   0.374425032641739 |  3.80924177094873 | 2 |  4
 20 | 4 |   0.644279096741229 |  6.55462277759638 | 2 |  7
 20 | 4 |   0.626386553514749 |  6.37259161753762 | 2 |  6
 20 | 4 |   0.320786282420158 |  3.26354383391728 | 2 |  3
 30 | 4 |   0.848764919675887 |  13.6565766414436 | 1 | 14
 30 | 4 |   0.268079651053995 |   4.3133855037604 | 1 |  4
 30 | 4 |   0.250213726423681 |  4.02592384820881 | 1 |  4
 30 | 4 |   0.497460773680359 |  8.00411400658722 | 1 |  8
 42 | 3 |   0.571454062592238 |  33.6117414945302 | 1 | 34
 42 | 3 | 0.00338772451505065 | 0.199258922297338 | 1 |  0
 42 | 3 |   0.139226260595024 |  8.18899958317244 | 1 |  8

As the query will fail if the same configuration (m, n) occurs several times in the inputs table, I add a primary key constraint on it:

ALTER TABLE inputs ADD PRIMARY KEY (m, n);

Nice! You've created a solution that goes beyond the question, solving the problem for a number of unique (m, n) pairs at the same time. The selective rounding up/down logic is a nice touch also. — Sim, Jan 14 '17 at 21:31

score 2 · Answer 2 · 2017-01-13T22:39:48.237

This seems to do it (Postgres):

with recursive inputs (n,m) as (
  values (10,100)
), worker (i, total, rn) as (

   select val, val as total, 1 as rn
   from ( 
     select floor(random() * (m/n - 1) + 1)
     from inputs
   ) as x (val)

   union all

   select c.val, p.total + c.val, p.rn + 1
   from worker p
     join lateral (
       select floor(random() * (i.m - p.total - 1) + 1)
       from inputs i
     ) c (val) on p.rn < (select n from inputs)
)
select *
from worker
order by rn;

However this might be considered cheating because most of the times the total sum of the values (100 in the above example) is already reached after 6 or 7 rows (sometimes earlier sometimes later). Which means that the "random" numbers at the end aren't that random any more.

One of the good results is:

  i | total | rn
----+-------+---
  3 |     3 |  1
  1 |     4 |  2
 40 |    44 |  3
 33 |    77 |  4
 11 |    88 |  5
  2 |    90 |  6
  4 |    94 |  7
  3 |    97 |  8
  2 |    99 |  9
  1 |   100 | 10

But sometimes it's as bad as:

  i | total | rn
----+-------+---
  7 |     7 |  1
 59 |    66 |  2
 23 |    89 |  3
 10 |    99 |  4
  1 |   100 |  5
  0 |   100 |  6
  0 |   100 |  7
  0 |   100 |  8
  0 |   100 |  9
  0 |   100 | 10

But I didn't see any requirement that the random value has to be unique.

Online Example: http://rextester.com/VRBV22166

Great use of a lateral join here. I like the parsimony of the solution. I gave the answer to Fabian's attempt because his solution does not exhibit the accumulation of zeros at the end. — Sim, Jan 14 '17 at 21:34

David דודו Markovitz · Answer 3 · 2017-01-17T16:37:00.083

create table inputs (m int,n int);
insert into inputs (m,n) values (100,10);

select      i-lag(i,1,0) over (order by i)  as i

from       (select      *
            from       (select      i
                        from        generate_series (1,(select m from inputs)) as gs(i)
                        order by    random()
                        limit       (select n from inputs)-1
                        ) t

            union all

            select      100
            ) t

 order by   i

Sample result

+----+
| i  |
+----+
| 1  |
+----+
| 1  |
+----+
| 2  |
+----+
| 3  |
+----+
| 4  |
+----+
| 12 |
+----+
| 13 |
+----+
| 16 |
+----+
| 18 |
+----+
| 30 |
+----+

Very limited access, so just the idea -

Generate numbers 1..n-1
Order by random
Choose first m-1 numbers UNION ALL n
Order the numbers
Compute the distance between each 2 following numbers (LAG - for the first number, use 0 as the previous number)

Finally was available to write the code. Please take a look. — David דודו Markovitz, Jan 17 '17 at 16:32

SQL: generate N random non-negative integers adding up to a provided total M

3 Answers3