Estimating the probability density of sum of uniform random variables in python

Question

I have two random variables X and Y which are uniformly distributed on the simplex:

I want to evaluate the density of their sum:

After evaluating the above integral, my final goal is to compute the following integral:

To compute the first integral, I am generating uniformly distributed points in simplex and then checking if they belong to the desired region in the above integral and taking the fraction of points to evaluate the above density.

Once I compute the above density I am following a similar procedure to compute the above logarithm integral to compute its value. However, this has been extremely inefficient and taking a lot of time such 3-4 hours. Can anyone suggest me an efficient way to solve this in Python? I am using Numpy package.

Here is the code

import numpy as np
import math
import random
import numpy.random as nprnd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
#This function checks if the point x lies the simplex and the negative simplex shifted by z
def InreqSumSimplex(x,z):
    dim=len(x)
    testShiftSimpl= all(z[i]-1 <= x[i] <= z[i] for i in range(0,dim)) and (sum(x) >= sum(z)-1)
    return int(testShiftSimpl)

def InreqDiffSimplex(x,z):
    dim=len(x)
    testShiftSimpl= all(z[i] <= x[i] <= z[i]+1 for i in range(0,dim)) and (sum(x) <= sum(z)+1)
    return int(testShiftSimpl)
#This is for the density X+Y
def DensityEvalSum(z,UniformCube):
    dim=len(z)
    Sum=0
    for gen in UniformCube:
        Exponential=[-math.log(i) for i in gen] #This is exponentially distributed
        x=[i/sum(Exponential) for i in Exponential[0:dim]] #x is now uniformly distributed on simplex

        Sum+=InreqSumSimplex(x,z)

    Sum=Sum/numsample

    FunVal=(math.factorial(dim))*Sum;
    if FunVal<0.00001:
        return 0.0
    else:
        return -math.log(FunVal)
#This is for the density X-Y
def DensityEvalDiff(z,UniformCube):
    dim=len(z)
    Sum=0
    for gen in UniformCube:
        Exponential=[-math.log(i) for i in gen]
        x=[i/sum(Exponential) for i in Exponential[0:dim]]

    Sum+=InreqDiffSimplex(x,z)

    Sum=Sum/numsample

    FunVal=(math.factorial(dim))*Sum;
    if FunVal<0.00001:
        return 0.0
    else:
        return -math.log(FunVal)
def EntropyRatio(dim):    
    UniformCube1=np.random.random((numsample,dim+1)); 
    UniformCube2=np.random.random((numsample,dim+1))

    IntegralSum=0; IntegralDiff=0

    for gen1,gen2 in zip(UniformCube1,UniformCube2):

        Expo1=[-math.log(i) for i in gen1];        Expo2=[-math.log(i) for i in gen2]

        Sumz=[ (i/sum(Expo1)) + j/sum(Expo2) for i,j in zip(Expo1[0:dim],Expo2[0:dim])] #Sumz is now disbtributed as X+Y

        Diffz=[ (i/sum(Expo1)) - j/sum(Expo2) for i,j in zip(Expo1[0:dim],Expo2[0:dim])] #Diffz is now distributed as X-Y

    UniformCube=np.random.random((numsample,dim+1))

    IntegralSum+=DensityEvalSum(Sumz,UniformCube) ; IntegralDiff+=DensityEvalDiff(Diffz,UniformCube)

    IntegralSum= IntegralSum/numsample; IntegralDiff=IntegralDiff/numsample

    return ( (IntegralDiff +math.log(math.factorial(dim)))/ ((IntegralSum +math.log(math.factorial(dim)))) )

Maxdim=11
dimlist=range(2,Maxdim)
Ratio=len(dimlist)*[0]
numsample=10000

for i in range(len(dimlist)):
    Ratio[i]=EntropyRatio(dimlist[i])

@MarkDickinson: I am actually interested in higher values of n, like upto 100,200 etc. But I need to graph all the values starting from n=2 till 200. That's why I want to make it efficient. — pikachuchameleon, Feb 14 '16 at 17:04
@MaxNoe: It's around 100 lines of python code. How do I upload code? — pikachuchameleon, Feb 14 '16 at 17:04
Did you profile the code? What actually is taking so long? You could use the `profilehooks` module for this. — MaxNoe, Feb 14 '16 at 17:10
You can just post the code in your question. It won't get too big. — MaxNoe, Feb 14 '16 at 17:11
@MaxNoe: I added the code. Let me know if I have to add further comments. — pikachuchameleon, Feb 14 '16 at 17:39
I'd expect that you should be able to evaluate $f_{X + Y}$ analytically. But that would be math question rather than a programming one. I'd also note that you seem to be generating samples in a uniform cube rather than directly in a simplex; that's going to make things very slow for large `n`. Why not generate points directly from the simplices instead? — Mark Dickinson, Feb 14 '16 at 18:10
@MarkDickinson: Generating points in simplices is directly related to generation in hyper cube. If U_1,...,U_n+1 are uniform points in the cube, then for E_i= -log(U_i), (E_1/E_1+...+E_n+1 , ...., E_n/E_1+...+E_n+1) be a uniform point in the n-simplex. Do you know any other method where I can generate points directly form simplex ? — pikachuchameleon, Feb 14 '16 at 18:26
Yes: generate `x1, ..., xn`, i.i.d. variables uniformly distributed in [0, 1], sort them so that `x1 <= x2 <= x3 <= ...`, then `x1, x2-x1, x3-x2, ..., xn-x(n-1)` is uniformly distributed on the n-dimensional simplex. I don't believe that the method you suggest generates a uniform distribution. (Try a scatter plot for `n=2`, and see what you get.) — Mark Dickinson, Feb 14 '16 at 18:45
Okay, the analytic solution I had in mind involves a computation of complexity `2^n`. Feasible for `n < 20` or so, not so much for `n = 100`. I'll butt out now. — Mark Dickinson, Feb 14 '16 at 19:39
@MarkDickinson yes, generating via `-log(U_i); normalize` is a valid method to generate simplex, coming from gamma-variate sampling for general Dirichlet distribution — Severin Pappadeux, Feb 14 '16 at 20:37
@SeverinPappadeux: Thanks. I learned something new today. :-) — Mark Dickinson, Feb 14 '16 at 20:38

score 2 · Answer 1 · edited May 23 '17 at 12:19

Not sure it is an answer to your question, but let's start

First, here is some code samples and discussion how to properly sample from Dirichlet(n) (a.k.a. simplex), via gammavariate() or via -log(U) as you did but with proper handle for potential corner case, link

Problem with your code as I can see is that, say, for sampling dimension=2 simplex you're getting three (!) uniform numbers, but skipping one when doing list comprehension for x. This is wrong. To sample n-dimension Dirichlet you should get exactly n U(0,1) and transform then (or n samples from gammavariate).

But, best solution might be just use numpy.random.dirichlet(), it is written in C and might be fastest of all, see link.

Last one, in my humble opinion, you're not properly estimating log(PDF(X+Z)). Ok, you find some are, but what is PDF(X+Z) at this point?

Does this

testShiftSimpl= all(z[i]-1 <= x[i] <= z[i] for i in range(0,dim)) and (sum(x) >= sum(z)-1)
return int(testShiftSimpl)

looks like PDF? How did you managed to get it?

Simple test: integration of PDF(X+Z) over whole X+Z area. Did it produce 1?

UPDATE

Looks like we might have different ideas what we call simplex, Dirichlet etc. I'm pretty much along with this definition, where in d dim space we have d points and d-1 simplex is convex hull connecting vertices. Simplex dimension is always one less than space due to relation between coordinates. In simplest case, d=2, 1-simplex is a segment connecting points (1,0) and (0,1), and from Dirichlet distribution I've got the picture

In the case of d=3 and 2-simplex we have triangle connecting points (1,0,0), (0,1,0) and (0,0,1)

Code, Python

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

import math
import random

def simplex_sampling(d):
    """
    Sample one d-dim point from Dirichet distribution
    """
    r = []
    sum = 0.0

    for k in range(0, d):
        x = random.random()
        if x == 0.0:
            return make_corner_sample(d, k)

        t = -math.log(x)
        r.append(t)
        sum += t

    norm = 1.0 / sum

    for k in range(0, d):
        r[k] *= norm

    return r

def make_corner_sample(d, k):
    """
    U(0,1) number k is zero, it is a corner point in simplex
    """
    r = []
    for i in range(0, d):
        if i == k:
            r.append(1.0)
        else:
            r.append(0.0)

    return r

N = 500 # numer of points to plot
d = 3   # dimension of the space, 2 or 3

x = []
y = []
z = []

for k in range(0, N):
    pt = simplex_sampling(d)

    x.append(pt[0])
    y.append(pt[1])
    if d > 2:
        z.append(pt[2])

if d == 2:
    plt.scatter(x, y, alpha=0.1)
else:
    fig = plt.figure()
    ax  = fig.add_subplot(111, projection='3d')
    ax.scatter(x, y, z, alpha=0.1)

    ax.set_xlabel('X Label')
    ax.set_ylabel('Y Label')
    ax.set_zlabel('Z Label')

plt.show()

The above condition makes sure that z-x lies in the simplex region which is what we require for density evaluation. So I'm counting fraction of points in simplex that satisfy above condition which is an estimate of the pdf. — pikachuchameleon, Feb 15 '16 at 17:26
Also for generation of points inside simplex, I am not using the Dirichlet distribution procedure as you pointed out. But my procedure is that if U1,...,U_n+1 are exponentially distributed with rate 1, then (U1/U_1+..U_n+1,....., U_n/U_1+....+U_n+1) is uniform on simplex. That's why I am skipping one during list comprehension. — pikachuchameleon, Feb 15 '16 at 17:30

Estimating the probability density of sum of uniform random variables in python

1 Answers1