12

I have a table of let's say 250 URLs :

create table url (
  id serial,
  url varchar(64)
)

These URLs correspond each to a website. Each of the websites have a different popularity. Let's say that the id=125 (the one centered on the gaussian) is the most popular, the ones at id=1 or id=250 are the least popular.

I want to populate a "log" table like the following one with a value of url among the ones provided in the "url" table, but taking into account that different URLS might appear more frequently (for exemple url whose id is 125 will be the most popular).

create table log (
  id serial,
  url_id integer
)

I want to avoid using random() since it is uniform and not very "real".

How can this be achieved with Postgresql ?

vyegorov
  • 21,787
  • 7
  • 59
  • 73
SCO
  • 1,832
  • 1
  • 24
  • 45
  • 2
    Why do you assume that popularity or ranking has a Gaussion distribution? – wildplasser Feb 24 '12 at 13:59
  • 2
    You can calculate any distribution using the PDF of that distribution using RAND (which produces values between 0 and 1, right?). For gaussian distro, that would be 1/2(1 + erf(x-mu)/sqrt(2sigma^2)) - see http://en.wikipedia.org/wiki/Normal_distribution – Roman Luštrik Feb 24 '12 at 14:05
  • @wildplasser : because that law seems pretty good for what I try to model. I admit it could have been any other ! – SCO Feb 24 '12 at 15:18

6 Answers6

16

The sum of 12 uniform distributions on the range [0, 1) is a good approximation to a Gaussian distribution bounded in the range [0, 12). This can then easily be re-scaled by multiplying by a constant and then adding/subtracting a constant.

select
    random() + 
    random() + 
    random() +
    random() + 
    random() + 
    random() +
    random() + 
    random() + 
    random() +
    random() + 
    random() + 
    random();

http://books.google.com/books?id=EKA-yeX2GVgC&pg=PA185&lpg=PA185&dq=%22sum+of+12+uniform+random+variables%22&source=bl&ots=YfwwE0fBB3&sig=HX9J9Oe6x316kVL8uamDU_GOsn4&hl=en&sa=X&ei=bJLZUur1GozaqwGHm4DQDQ&ved=0CEUQ6AEwAw#v=onepage&q=%22sum%20of%2012%20uniform%20random%20variables%22&f=false

user2179977
  • 656
  • 6
  • 14
  • 2
    I accepted this because I found it the easiest and most elegant way , whatever language is used. Thanks to all other contributors. – SCO Aug 21 '14 at 12:47
9

I was searching for a way to generate numbers according to a gaussian distribution and first found this post. This is why I share what I've found just after:

There is, since at least PostgreSQL 8.4, an additional module called tablefunc (http://www.postgresql.org/docs/current/static/tablefunc.html).

It proposes a function normal_rand(n, mean, stddev) generating n pseudo-random numbers using a gaussian distribution (so this function returns a set of values, typically used in the FROM clause). However, if you set n to be 1, it can be used as a function returning a value and not a set of values.

Considering a table nb10 containing 10 records, the two following queries return a set of 10 pseudo-random numbers following a standard gaussian distribution (mean = 0, stddev = 1)

SELECT normal_rand(1, 0, 1) FROM nb10;

and

SELECT * from normal_rand(10, 0, 1);

I hope this could help anyone in the future ... :-)

To answer your question specifically, you could use something like:

SELECT floor(random_rand(1, 0, 1) * 250 + 125);

Unfortunately, it is possible to obtain an answer not in the range [0, 249] using this query. You could for example:

  • use a recursive query, which I find a bit overkill, for discarding values not in the range [0, 249], or
  • do your select into a loop into your host language, accepting the value only if its in the range [0, 249], or
  • use the modulo operator to remain in the [0, 250[ range, I think this is the best solution, although it alternates slightly the gaussian curve. Here is the final query I suggest you use (the modulo/+/modulo tricks is because -x modulo y with x a positive number gives a negative number in PostgreSQL, which is not a bad thing :p):

    SELECT ((floor(normal_rand(1,0,1)*250 + 125)::int % 250) + 250) % 250 as v;
    
Fabian Pijcke
  • 2,920
  • 25
  • 29
2

The simple fact is you want to create your own function which wraps rand() in something that provides a gaussian distribution either implicitly or explicitly.

I don't have the statistical background to tell you how to transform a uniform distribution into a gaussian one, but you'd have to write a converter. Something like as provided at http://www.perlmonks.org/?node_id=26889 (if you don't like Perl you could probably rewrite this in pl/pgsql or even plain SQL).

CREATE OR REPLACE FUNCTION gaussian_rand() RETURNS numeric LANGUAGE PLPERL VOLATILE AS
$$
    my ($u1, $u2);  # uniformly distributed random numbers
    my $w;          # variance, then a weight
    my ($g1, $g2);  # gaussian-distributed numbers

    do {
        $u1 = 2 * rand() - 1;
        $u2 = 2 * rand() - 1;
        $w = $u1*$u1 + $u2*$u2;
    } while ( $w >= 1 );

    $w = sqrt( (-2 * log($w))  / $w );
    $g2 = $u1 * $w;
    $g1 = $u2 * $w;
    # return both if wanted, else just one
    return $g1;

$$;
Chris Travers
  • 25,424
  • 6
  • 65
  • 182
1

The tablefunc module provides a random function with a normal distribution. You can test if it's installed using:

SELECT normal_rand(1, 0, 1); -- generates 1 single value with mean 0 and a standard deviation of 1

The query above should generate a single value in a normal distribution

If you don't have it installed, try this:

CREATE EXTENSION "tablefunc";

Otherwise you will need to log in as a super user and install the module.

mgoldwasser
  • 14,558
  • 15
  • 79
  • 103
  • Oh, this is extremely interesting as well, and opens now horizons to pivot tables for example. Thank you very much ! – SCO Nov 26 '17 at 16:33
1

You can also implement this directly in the build-in language PL/PgSQL

create or replace
function random_gauss( avg real = 0, stddev real = 1 )
returns real language plpgsql as $$
declare x1 real; x2 real; w real;
begin
  loop
    x1 = 2.0 * random() - 1.0;
    x2 = 2.0 * random() - 1.0;
    w = x1*x1 + x2*x2;
    exit when w < 1.0;
  end loop;
  return avg + x1 * sqrt(-2.0*ln(w)/w) * stddev;
end; $$;

with data as (
  select t, random_gauss(100,15)::integer score from generate_series(1,1000000) t
)
select
  score,
  sum(1),
  repeat('=',sum(1)::integer/500) bar
from data
where score between 60 and 140
group by score
order by 1;

rollback;

This gives us something like this from a sample of 1 million numbers with an average of 100 and a standard deviation of 15.

 score |  sum  |                          bar
-------+-------+--------------------------------------------------------
    60 |   764 | =
    61 |   893 | =
    62 |  1059 | ==
    63 |  1269 | ==
    64 |  1524 | ===
    65 |  1740 | ===
    66 |  1990 | ===
    67 |  2346 | ====
    68 |  2741 | =====
    69 |  3160 | ======
    70 |  3546 | =======
    71 |  4109 | ========
    72 |  4633 | =========
    73 |  5252 | ==========
    74 |  5952 | ===========
    75 |  6536 | =============
    76 |  7429 | ==============
    77 |  8140 | ================
    78 |  9061 | ==================
    79 | 10063 | ====================
    80 | 10844 | =====================
    81 | 11911 | =======================
    82 | 13180 | ==========================
    83 | 13880 | ===========================
    84 | 15111 | ==============================
    85 | 16016 | ================================
    86 | 17310 | ==================================
    87 | 18262 | ====================================
    88 | 19615 | =======================================
    89 | 20400 | ========================================
    90 | 21186 | ==========================================
    91 | 22190 | ============================================
    92 | 23103 | ==============================================
    93 | 24150 | ================================================
    94 | 24327 | ================================================
    95 | 24992 | =================================================
    96 | 25505 | ===================================================
    97 | 25868 | ===================================================
    98 | 26146 | ====================================================
    99 | 26574 | =====================================================
   100 | 27104 | ======================================================
   101 | 26599 | =====================================================
   102 | 26345 | ====================================================
   103 | 25940 | ===================================================
   104 | 25485 | ==================================================
   105 | 25157 | ==================================================
   106 | 24827 | =================================================
   107 | 23844 | ===============================================
   108 | 23262 | ==============================================
   109 | 22211 | ============================================
   110 | 21326 | ==========================================
   111 | 20315 | ========================================
   112 | 19496 | ======================================
   113 | 18026 | ====================================
   114 | 17182 | ==================================
   115 | 16026 | ================================
   116 | 14979 | =============================
   117 | 13959 | ===========================
   118 | 12840 | =========================
   119 | 11718 | =======================
   120 | 11169 | ======================
   121 | 10037 | ====================
   122 |  9273 | ==================
   123 |  8041 | ================
   124 |  7402 | ==============
   125 |  6761 | =============
   126 |  5827 | ===========
   127 |  5257 | ==========
   128 |  4736 | =========
   129 |  4153 | ========
   130 |  3494 | ======
   131 |  3103 | ======
   132 |  2731 | =====
   133 |  2379 | ====
   134 |  2064 | ====
   135 |  1696 | ===
   136 |  1481 | ==
   137 |  1246 | ==
   138 |  1024 | ==
   139 |   910 | =
   140 |   788 | =
Kjetil S.
  • 3,468
  • 20
  • 22
0

For people like me who came here to sample from a true, continuous Gaussian distribution in PostgreSQL, here's how to do it:

SELECT cos(2*pi()*random()) * sqrt(-2*ln(random())) AS my_standard_normal_random_variable

Explanation: this consists of sampling the polar coordinates of a 2D Gaussian (the angle follows a Uniform Distribution over [0, 2π], and the CDF of the radius r is given by r -> 1 - exp(-0.5 * r^2) corresponding to an Inverse CDF of q -> sqrt(-2 * ln(1 - q))), then projecting to the x coordinate by using the cosine, which works because the x coordinate of a 2D standard Gaussian is a 1D standard Gaussian. This strategy is known as the Box-Muller transform.

Check: let us look at some aggregate statistics:

WITH my_gaussian_samples AS (
  SELECT cos(2*pi()*random()) * sqrt(-2*ln(random())) AS x
  FROM generate_series(1,1000000)
) SELECT 
  AVG(x) AS mean,
  AVG(x^2) AS standard_deviation, 
  COUNT((x > -1.96 AND x < 1.96) OR NULL)::float/COUNT(*) AS in_95_interval
FROM my_gaussian_samples;

          mean          | standard_deviation | in_95_interval 
------------------------+--------------------+----------------
 -0.0012901731683899918 | 0.9988458311780355 |       0.950221
Valentin Waeselynck
  • 5,950
  • 26
  • 43