Discretization of probability array in Python

Question

I have a numpy array (actually imported from a GIS raster map) which contains probability values of occurrence of a species like following example:

a = random.randint(1.0,20.0,1200).reshape(40,30)
b = (a*1.0)/sum(a)

Now I want to get a discrete version for that array again. Like if I have e.g. 100 individuals which are located on the area of that array (1200 cells) how are they distributed? Of course they should be distributed according to their probability, meaning lower values indicated lower probability of occurrence. However, as everything is statistics there is still the chance that a individual is located at a low probability cell. It should be possible that multiple individuals can occupy on cell...

It is like transforming a continuous distribution curve into a histogram again. Like many different histograms may result in a certain distribution curve it should also be the other way round. Accordingly applying the algorithm I am looking for will produce different discrete values each time.

...is there any algorithm in python which can do that? As I am not that familiar with discretization maybe someone can help.

"*there is still the chance that a individual is located at a low probability cell.*" - this means you want another random function. How random should it be? Without random function you get always the same result for the same given input. — eumiro, Jul 25 '12 at 13:07

ecatmur · Accepted Answer · 2012-07-25T13:24:18.870

3

Use random.choice with bincount:

np.bincount(np.random.choice(b.size, 100, p=b.flat),
            minlength=b.size).reshape(b.shape)

If you don't have NumPy 1.7, you can replace random.choice with:

np.searchsorted(np.cumsum(b), np.random.random(100))

giving:

np.bincount(np.searchsorted(np.cumsum(b), np.random.random(100)),
            minlength=b.size).reshape(b.shape)

edited Jul 25 '12 at 13:24

answered Jul 25 '12 at 13:18

ecatmur

152,476
27
293
366

I am not that familiar with python yet, but I can't get the module (random.choice) working: >>> import numpy >>> numpy.version.version '1.6.2' >>> numpy.random.choice() Traceback (most recent call last): File "", line 1, in numpy.random.choice() AttributeError: 'module' object has no attribute 'choice' – Johannes Jul 25 '12 at 13:38
@Johannes `random.choice` was added in NumPy 1.7; which version do you have? Try my alternative solution if you have 1.6 (check `np.version.version`). – ecatmur Jul 25 '12 at 13:46

score 2 · Answer 2 · answered Aug 06 '12 at 11:39

So far I think ecatmur's answer seems quite reasonable and simple.

I just want to add may a more "applied" example. Considering a dice with 6 faces (6 numbers). Each number/result has a probability of 1/6. Displaying the dice in form of an array could look like:

b = np.array([[1,1,1],[1,1,1]])/6.0

Thus rolling the dice 100 times (n=100) results in following simulation:

np.bincount(np.searchsorted(np.cumsum(b), np.random.random(n)),minlength=b.size).reshape(b.shape)

I think that can be an appropriate approach for such an application. Thus thank you ecatmur for your help!

/Johannes

I am still not sure if it is correct to use that procedure to simulate e.g. 100 dice rolls? — Johannes, Aug 08 '12 at 11:01

score 1 · Answer 3 · edited May 23 '17 at 12:14

this is similar to my question i had earlier this month.

import random
def RandFloats(Size):
    Scalar = 1.0
    VectorSize = Size
    RandomVector = [random.random() for i in range(VectorSize)]
    RandomVectorSum = sum(RandomVector)
    RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
    return RandomVector

from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
    """
    Inputs:
    ListSize = the size of the list to return
    ListSumValue = The sum of list values
    Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma  (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).  
    Output:
    A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
    """
    if type(Distribution) == list:
        DistributionSize = len(Distribution)
        if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
            Values = multinomial(ListSumValue,Distribution,size=1)
            OutputValue = Values[0]
    elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
        UniformDistro = [1/ListSize for i in range(ListSize)]
        Values = multinomial(ListSumValue,UniformDistro,size=1)
        OutputValue = Values[0]
    elif Distribution.lower() == 'normal':
        """
            Normal Distribution Construction....It's very flexible and hideous
            Assume a +-3 sigma range.  Warning, this may or may not be a suitable range for your implementation!
            If one wishes to explore a different range, then changes the LowSigma and HighSigma values
            """
            LowSigma    = -3#-3 sigma
            HighSigma   = 3#+3 sigma
            StepSize    = 1/(float(ListSize) - 1)
            ZValues     = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
            #Construction parameters for N(Mean,Variance) - Default is N(0,1)
            Mean        = 0
            Var         = 1
            #NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
            NormalDistro= list()
            for i in range(len(ZValues)):
                if i==0:
                    ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
                    NormalDistro.append(ERFCVAL)
                elif i ==  len(ZValues) - 1:
                    ERFCVAL = NormalDistro[0]
                    NormalDistro.append(ERFCVAL)
                else:
                    ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
                    ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
                    ERFCVAL = ERFCVAL1 - ERFCVAL2
                    NormalDistro.append(ERFCVAL)  
            #print "Normal Distribution sum = %f"%sum(NormalDistro)
            Values = multinomial(ListSumValue,NormalDistro,size=1)
            OutputValue = Values[0]
        else:
            raise ValueError ('Cannot create desired vector')
        return OutputValue
    else:
        raise ValueError ('Cannot create desired vector')
    return OutputValue

ProbabilityDistibution = RandFloats(1200)#This is your probability distribution for your 1200 cell array
SizeDistribution = RandIntVec(1200,100,Distribution=ProbabilityDistribution)#for a 1200 cell array, whose sum is 100 with given probability distribution

The two main lines that are important are the last two lines in the code above

Discretization of probability array in Python

3 Answers3