Beta and lognorm distributions in PostgreSQL?

Question

I'm running a fairly large Monte Carlo simulation currently in code, and performance leaves something to be desired.

I'm wondering if there is a way to run it directly on the database, where I figure performance would be much better. I can generate random numbers, but I didn't see statistical distribution functions.

The first step that would already help me a lot is:

I have one table of parameters where each row is one beta distribution with all its parameters. I'd like to generate random values with these distribution parameters and store them in a seperate table (the Monte Carlo simulation table, one row per simulation run).

How do I go about this?

I think the only things Postgres provides are `random()` (i.e. U\[0,1)), and `normal_rand()` in the [`tablefunc`](https://www.postgresql.org/docs/current/tablefunc.html) module, and I couldn't find any extensions which provide beta or log-normal generators. There are various algorithms for transforming distributions, so you could implement one of these in a PL/pgSQL function, though performance might not be great. If you have admin access, you can also write functions in other languages (C, Perl, Python, PHP, R, JavaScript, Java...) and call some third-party library. — Nick Barnes, Dec 12 '18 at 10:59
Have you heard about [Inverse Transform Sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling)? — jlandercy, Dec 12 '18 at 19:47
It helps a lot. I won't have time to include it in the code until the xmas holidays, but it is very much what I was looking for. — Tom, Dec 12 '18 at 21:48
I have finalized my question. Unfortunately it seems not possible to use Inverse Transform Sampling for Beta distribution. Let me know if I have sufficiently addressed you question. — jlandercy, Dec 13 '18 at 11:33

jlandercy · Answer 1 · 2018-12-14T14:52:34.060

Methodology

As you pointed out, PostgreSQL is able to generate Uniform distribution using the random() function.

A general answer to this kind question is Inverse Transform Sampling. And the limitations of this methodology are:

The ability to construct explicitly the Quantile Function (aka PPF) which can be defined as the Inverse Function of an Improper Integral: PPF(u) = CDF^(-1)(u) | u = CDF(x) = int(PDF(x), x=(-infinty,x));
The existence of PostgreSQL mathematical functions required to contruct the Quantile Function.

That is, provided the Quantile Function is explicit and we are able to construct it with PostgreSQL mathematical functions then we can create a Pseudo Random Generator for a specific distribution using random() as Uniform PRG.

Simple example: Exponential

Inverse Transform Sampling works well for Exponential Distribution:

CREATE OR REPLACE FUNCTION expon(N INTEGER, l FLOAT = 1)
RETURNS SETOF FLOAT AS
$BODY$
SELECT
    -(1/l)*ln(1 - random())
FROM
    generate_series(1, N) AS i;
$BODY$
LANGUAGE SQL;

This function generates N samples drawn from an Exponential Distribution of parameter l.

Lognormal

For Lognormal distribution the Quantile Function relies on the Error Function which is not implemented in PostgreSQL. Thus we need to implement the missing function (which is an integral, not impossible using WINDOWING functions but probably not the best idea) or find another way.

Fortunately, we can generate Normal Distribution samples using the Box-Muller Transform:

CREATE OR REPLACE FUNCTION norm(N INTEGER, mu FLOAT = 0, sigma FLOAT = 1)
RETURNS SETOF FLOAT AS
$BODY$
SELECT
    sigma*sqrt(-2.*ln(random()))*cos(2*pi()*random()) + mu
FROM
    generate_series(1, N) AS i;
$BODY$
LANGUAGE SQL;

The following call:

SELECT norm(10000);

Gives:

And MLE returns (mu=0.021131501222537274, sigma=1.0042820700537662) not too bad, we may be on the good track.

Then we can take the exponential of this function:

CREATE OR REPLACE FUNCTION lognorm(N INTEGER, mu FLOAT = 0, sigma FLOAT = 1)
RETURNS SETOF FLOAT AS
$BODY$
SELECT
    exp(x)
FROM
    norm(N, mu, sigma) AS x;
$BODY$
LANGUAGE SQL;

And we have a PRG for Lognormal distribution.

The following call:

SELECT lognorm(10000);

Also gives acceptable result:

MLE returns (sigma=0.9996878296400589, loc=0.0, exp(mu)=1.0002728392916154).

Numerical Integration & Error Function

Although it might be not performant, it is quite easy to estimate the Error Function with PostgreSQL using Trapezoid Rule. Thought it is a naive implementation:

CREATE OR REPLACE FUNCTION erf(x FLOAT, dx NUMERIC = 1e-3)
RETURNS FLOAT AS
$BODY$
WITH
D AS (
SELECT
    y::FLOAT,
    exp(-((y::FLOAT)^2)) AS fx0,
    LEAD(exp(-((y::FLOAT)^2))) OVER(ORDER BY y) AS fx1
FROM
    generate_series(0, x::NUMERIC, dx) AS y
)
SELECT
    COALESCE((2/sqrt(pi()))*SUM((D.fx1 + D.fx0)*dx::FLOAT/2), 0.)
FROM D;
$BODY$
LANGUAGE SQL IMMUTABLE;

And if we compare the result with exact form (Python, scipy), it looks not too bad we get at least 6 significant figures:

      x      psql     scipy        errabs        errrel
0   0.0  0.000000  0.000000  0.000000e+00           NaN
5   0.5  0.520500  0.520500 -7.323189e-08 -1.406953e-07
10  1.0  0.842701  0.842701 -6.918458e-08 -8.209863e-08
15  1.5  0.966105  0.966105 -2.973257e-08 -3.077571e-08
20  2.0  0.995322  0.995322 -6.888995e-09 -6.921371e-09
25  2.5  0.999593  0.999593 -9.076190e-10 -9.079885e-10
30  3.0  0.999978  0.999978 -6.962642e-11 -6.962795e-11
35  3.5  0.999999  0.999999 -3.149592e-12 -3.149594e-12
40  4.0  1.000000  1.000000 -8.404388e-14 -8.404388e-14
45  4.5  1.000000  1.000000  1.110223e-16  1.110223e-16
50  5.0  1.000000  1.000000  2.442491e-15  2.442491e-15

So we could use erf function to perform Inverse Transform Sampling for Normal and Lognormal as we did for Exponential, but I might be a bad idea. It should perform poorly due to Algorithm Complexity and Integration Inaccuracies.

Beta

Unfortunately the Inverse Transform Sampling does not seem to suit for Beta Distribution because the Quantile Function is not expressible as a simple function: it requires to get the inverse of the Regularized Incomplete Beta Function. I don't know if it is possible: Wikipedia does not have Quantile Function referenced for Beta distribution.

For this case you may need to compile the function in some Programming Language (such as C/C++) and bind it to a PostgreSQL function as @Nick Barnes suggested in his comment.

Technical Considerations

As @Nick Barnes pointed out in his comments:

Function using random() are not IMMUTABLE (they are VOLATILE the default) because they change seed value of the PostgreSQL PRG;
Current implementations presented here are naive, they do not handle edge cases such as ln(0.);
Functions in LANGUAGE SQL usually perform well (although we must consider their complexities and their convergences);
Returning SETOF FLOAT is better than using FLOAT[] and avoid the need to unnest(), as I did in previous versions of the SQL Functions;
Limiting cast such as ::FLOAT whenever it is possible;
There is a function pi() no need to evaluate it with 2.*acos(0.).

Really nice answer! I didn't know there was a simple closed-form transformation (I remember doing this kind of thing with [rejection sampling](https://en.wikipedia.org/wiki/Rejection_sampling), so I was expecting some messy procedural loops; anything you can do in plain SQL should perform very well). — Nick Barnes, Dec 13 '18 at 10:51