2

In reading and experimenting with numpy.random, I can't seem to find or create what I need; a 10 parameter Python pseudo-random value generator including count, min, max, mean, sd, 25th%ile,   50th%ile (median),   75th%ile, skew, and kurtosis.

From https://docs.python.org/3/library/random.html I see these distributions uniform, normal (Gaussian), lognormal, negative exponential, gamma, and beta distributions, though I need to generate values directly to a distribution defined only by my 10 parameters, with no reference to a distribution family.

Is there documentation, or an author(s), of a numpy.random.xxxxxx(n, min, max, mean, sd, 25%, 50%, 75%, skew, kurtosis), or what please is the closest existing source code that I might modify to achieve this goal?

This would be the reverse of describe() including skew and kurtosis in a way.  I could do a loop or optimize until a criteria is met with random generated numbers, though that could take an infinite amount of time to meet my 10 parameters.

I have found optim in R which generates a data set, but have so far been able to increase the parameters in the R optim source code or duplicate it with Python scipy.optimize or similar, though these still depend on methods instead of directly psudo-randomly creating a data set according to my 10 parameters as I need to;

m0 <- 20
sd0 <- 5
min <- 1
max <- 45
n <- 15
set.seed(1)
mm <- min:max
x0 <- sample(mm, size=n, replace=TRUE)
objfun <- function(x) {(mean(x)-m0)^2+(sd(x)-sd0)^2}
candfun <- function(x) {x[sample(n, size=1)] <- sample(mm, size=1)
    return(x)}
objfun(x0) ##INITIAL RESULT:83.93495
o1 <- optim(par=x0, fn=objfun, gr=candfun, method="SANN", control=list(maxit=1e6))
mean(o1$par) ##INITIAL RESULT:20
sd(o1$par) ##INITIAL RESULT:5
plot(table(o1$par))
  • Your 10 parameters do not define a probability distribution function. You have to firs find a way to define the distribution. – rpoleski Apr 25 '20 at 22:44
  • Thank you Peter for question improvement and rpoleski for a briefer problem sum. Maybe assumption basis feedback, which could be faulty, would help here. Humans created and use distribution families (normal, etc.,) so our limited capabilities can perceive and categorize PDFs. But, Python runs on relatively infinite ability machines which "should" be able to generate 10-parameter psudo-random values, without any distrib family references (Parametric Density Estimation). This is my current goal to code in Python. IE; numpy.random.xxx(n=3,min=4,max=6,mu=5,,,,,,,) psudo-randomly = [4,5,6]. Ideas? – Thomas Gehrmann Apr 26 '20 at 00:57
  • There is no such function in `numpy.random`. You need to define in some way the probability distribution. The title of your question is not very informative then. Please try to rephrase the question to get the answer. – rpoleski Apr 26 '20 at 07:02
  • Thank you rpoleski. Hopefully I have better worded my title now. I have modified describe() source file generic.py to describeSK() so it includes Skew and Kurtosis. Similarly, I have so far unsuccessfully tried to modify an existing, or add a new, definition in the numpy.random source file __init__.py, which will use my 10 parameters to define how to psudo-randomly generate values. Any init__.py or alternative suggestions are eagerly welcomed. – Thomas Gehrmann Apr 26 '20 at 15:54
  • Is there another language/tool that does currently have this ability to use n,min,max,mean,sd,25%,50%,75%,skew, kurtosis as parameters to define psudo-random values of a Probability Density Estimate/Function? – Thomas Gehrmann Apr 26 '20 at 16:19

1 Answers1

1

The most general way to generate a random number following a distribution is as follows:

  • Generate a uniform random number bounded by 0 and 1 (e.g., numpy.random.random()).
  • Take the inverse CDF (inverse cumulative distribution function) of that number.

The result is a number that follows the distribution.

In your case, the inverse CDF (ICDF(x)) is determined already by five of your parameters -- the minimum, maximum, and three percentiles, as follows:

  • ICDF(0) = minimum
  • ICDF(0.25) = 25th percentile
  • ICDF(0.5) = 50th percentile
  • ICDF(0.75) = 75th percentile
  • ICDF(1) = maximum

Thus, you already have some idea of how the inverse CDF looks like. And all you have to do now is somehow optimize the inverse CDF for the other parameters (mean, standard deviation, skewness, and kurtosis). For example, you can "fill in" the inverse CDF at the other percentiles and see how well they match the parameters you're going after. In this sense, a good starting guess is a linear interpolation of the percentiles just mentioned. Another thing to keep in mind is that the inverse CDF "can never go down".


The following code shows a solution. It does the following steps:

  • It calculates an initial guess for the inverse CDF via a linear interpolation. The initial guess consists of the values of that function at 101 evenly spaced points, including the five percentiles mentioned above.
  • It sets up the bounds of the optimization. The optimization is bounded by the minimum and maximum values everywhere except at the five percentiles.
  • It sets up the other four parameters.
  • It then passes the objective function (_lossfunc), the initial guess, the bounds and the other parameters to SciPy's scipy.optimize.minimize method for optimization.
  • Once the optimization finishes, the code checks for success and raises an error if unsuccessful.
  • If the optimization succeeds, the code calculates an inverse CDF for the final result.
  • It generates N uniform random values.
  • It transforms those values with the inverse CDF and returns those values.
import scipy.stats.mstats as mst
from scipy.optimize import minimize
from scipy.interpolate import interp1d
import numpy

# Define the loss function, which compares the calculated
# and ideal parameters
def _lossfunc(x, *args):
    mean, stdev, skew, kurt, chunks = args
    st = (
        (numpy.mean(x) - mean) ** 2
        + (numpy.sqrt(numpy.var(x)) - stdev) ** 2
        + ((mst.skew(x) - skew)) ** 2
        + ((mst.kurtosis(x) - kurt)) ** 2
    )
    return st

def adjust(rx, percentiles):
    eps = (max(rx) - min(rx)) / (3.0 * len(rx))
    # Make result monotonic
    for i in range(1, len(rx)):
        if (
            i - 2 >= 0
            and rx[i - 2] < rx[i - 1]
            and rx[i - 1] >= rx[i]
            and rx[i - 2] < rx[i]
        ):
            rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
        elif rx[i - 1] >= rx[i]:
            rx[i] = rx[i - 1] + eps
    # Constrain to percentiles
    for pi in range(1, len(percentiles)):
        previ = percentiles[pi - 1][0]
        prev = rx[previ]
        curr = rx[percentiles[pi][0]]
        prevideal = percentiles[pi - 1][1]
        currideal = percentiles[pi][1]
        realrange = max(eps, curr - prev)
        idealrange = max(eps, currideal - prevideal)
        for i in range(previ + 1, percentiles[pi][0]):
            if rx[i] >= currideal or rx[i] <= prevideal:
              rx[i] = (
                  prevideal
                  + max(eps * (i - previ + 1 + 1), rx[i] - prev) * idealrange / realrange
              )
        rx[percentiles[pi][0]] = currideal
    # Make monotonic again
    for pi in range(1, len(percentiles)):
        previ = percentiles[pi - 1][0]
        curri = percentiles[pi][0]
        for i in range(previ+1, curri+1):
          if (
            i - 2 >= 0
            and rx[i - 2] < rx[i - 1]
            and rx[i - 1] >= rx[i]
            and rx[i - 2] < rx[i]
            and i-1!=previ and i-1!=curri
          ):
            rx[i - 1] = (rx[i - 2] + rx[i]) / 2.0
          elif rx[i - 1] >= rx[i] and i!=curri:
            rx[i] = rx[i - 1] + eps
    return rx

# Calculates an inverse CDF for the given nine parameters.
def _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt, chunks=100):
    if chunks < 0:
        raise ValueError
    # Minimum of 16 chunks
    chunks = max(16, chunks)
    # Round chunks up to closest multiple of 4
    if chunks % 4 != 0:
        chunks += 4 - (chunks % 4)
    # Calculate initial guess for the inverse CDF; an
    # interpolation of the inverse CDF through the known
    # percentiles
    interp = interp1d([0, 0.25, 0.5, 0.75, 1.0], [mn, p25, p50, p75, mx], kind="cubic")
    rnge = mx - mn
    x = interp(numpy.linspace(0, 1, chunks + 1))
    # Bounds, taking percentiles into account
    bounds = [(mn, mx) for i in range(chunks + 1)]
    percentiles = [
        [0, mn],
        [int(chunks * 1 / 4), p25],
        [int(chunks * 2 / 4), p50],
        [int(chunks * 3 / 4), p75],
        [int(chunks), mx],
    ]
    for p in percentiles:
        bounds[p[0]] = (p[1], p[1])
    # Other parameters
    otherParams = (mean, stdev, skew, kurt, chunks)
    # Optimize the result for the given parameters
    # using the initial guess and the bounds
    result = minimize(
        _lossfunc,  # Loss function
        x,  # Initial guess
        otherParams,  # Arguments
        bounds=bounds,
    )
    rx = result.x
    if result.success:
        adjust(rx, percentiles)
        # Minimize again
        result = minimize(
            _lossfunc,  # Loss function
            rx,  # Initial guess
            otherParams,  # Arguments
            bounds=bounds,
        )
        rx = result.x
        adjust(rx, percentiles)
        # Minimize again
        result = minimize(
            _lossfunc,  # Loss function
            rx,  # Initial guess
            otherParams,  # Arguments
            bounds=bounds,
        )
        rx = result.x
    # Calculate interpolating function of result
    ls = numpy.linspace(0, 1, chunks + 1)
    success = result.success
    icdf=interp1d(ls, rx, kind="linear")
    # == To check the quality of the result
    if False:
       meandiff = numpy.mean(rx) - mean
       stdevdiff = numpy.sqrt(numpy.var(rx)) - stdev
       print(meandiff)
       print(stdevdiff)
       print(mst.skew(rx)-skew)
       print(mst.kurtosis(rx)-kurt)
       print(icdf(0)-percentiles[0][1])
       print(icdf(0.25)-percentiles[1][1])
       print(icdf(0.5)-percentiles[2][1])
       print(icdf(0.75)-percentiles[3][1])
       print(icdf(1)-percentiles[4][1])
    return (icdf, success)

def random_10params(n, mn, p25, p50, p75, mx, mean, stdev, skew, kurt):
   """ Note: Kurtosis as used here is Fisher's kurtosis, 
     or kurtosis excess. Stdev is square root of numpy.var(). """
   # Calculate inverse CDF
   icdf, success = (None, False)
   tries = 0
   # Try up to 10 times to get a converging inverse CDF, increasing the mesh each time
   chunks = 500
   while tries < 10:
      icdf, success = _get_inverse_cdf(mn, p25, p50, p75, mx, mean, stdev, skew, kurt,chunks=chunks)
      tries+=1
      chunks+=100
      if success: break
   if not success:
     print("Warning: Estimation failed and may be inaccurate")
   # Generate uniform random variables
   npr=numpy.random.random(size=n)
   # Transform them with the inverse CDF
   return icdf(npr)

Example:

print(random_10params(n=1000, mn=39, p25=116, p50=147, p75=186, mx=401, mean=154.1207, stdev=52.3257, skew=.7083, kurt=.5383))

One last note: If you have access to the underlying data points, rather than just their statistics, there are other methods you can use to sample from the distribution those data points form. Examples include kernel density estimations, histograms, or regression models (particularly for time series data). See also Generate random data based on existing data.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • Thank you again Peter. I had not considered combining a known distribution family tool, then optimizing only the remaining parameters. I will try it out now. – Thomas Gehrmann Apr 26 '20 at 20:18
  • This approach still requires lots of work, and I don't see how one can fulfil all 10 constraints. I don't know R, but it's hard to see how the code in the question can do what's asked. In particular, `objfun` uses only 2 of the 10 parameters. – rpoleski Apr 27 '20 at 06:20
  • Still no complete success, though I remembered some SIMIO code for Discrete Distribution that when entered in cumulative distribution form for SIMIO as, "Random.Discrete(1, 0.25, 2, 0.50, 3, 0.75, 4, 1.00)", similar to Peter's ICDF proposal above, can define "a shape" for each quartile. Additionally, I found these two articles by Jason Brown Brownlee using ECDF(), though still not quite using all 10 parameters. I will keep trying, though any further suggestions would be much appreciated. https://machinelearningmastery.com/?s=%22parametric+density+estimation%22&post_type=post&submit=Search – Thomas Gehrmann May 01 '20 at 22:12
  • Did you try optimizing for skewness, kurtosis, mean, and standard deviation, keeping the other parameters fixed? (As mentioned before, the code in your question does not appear to do that.) If so, can you edit your question showing what results you're getting when optimizing this way, together with updated code? – Peter O. May 01 '20 at 23:35
  • Also, can you edit your question showing examples of "10-parameter sets" you expect to use? – Peter O. May 01 '20 at 23:42
  • Thank you again Peter. I have not had much success with coding CDF(ICDF(x)) above though I am novice. This is what I have; x=numpy.random.random(), which output .34624 . When I then enter my second line norm.ppf(x), I get an out put of -0.39548 , though I think I should be expecting a positive number each time. I am able to get all positives and nice plotting when using ecdfdata = ECDF(data) with each print('P(x<25): %.5f' % ecdfdata(25%))......etc as you indicated to do with ICDF above, and I have tried using percents at every 10%+ which may promise closer to target parameters. – Thomas Gehrmann May 03 '20 at 02:41
  • As for optimizing mean, std, skew, and kurtosis I am unsure how to incorporate them into a for loop; for n in range(0,n): nn = float(decimal.Decimal(randrange(min,max))/1) data.append(nn) print(data) Math wise I have; mean = (sum(data))/n, std = (max - min)/4 # roughly, skew = 3*(mean - five)/std, kurt = (4th moment; https://www.macroption.com/kurtosis-formula/) / (((std)^2)^2) Though, even if these are correct, I am unsure how to code this into defining the random values in the for loop, or how to use an opt while loop with the for loop until these 4 params are met. – Thomas Gehrmann May 03 '20 at 02:42
  • I can't thank you enough Peter O. This task and skill required are above my current level and my attempts have been no where close enough. The code you suggest seems to work well and fast, so I am busy studying the code and practicing with it to better understand how it works, especially mstats and interpolate which are new to me. As you suggest, I do have access to actual data, which has these actual sumstats; random_10params (n=19876, mn=39, p25=116, p50=147, p75=186, mx=401, mean=154.1207, stdev=52.3257, skew=.7083, kurt=.5383), though I keep hitting "value errors" when I use so far. ty,T – Thomas Gehrmann May 04 '20 at 04:19
  • Some progress maybe. I added some libraries and 'numpy.random.seed(1)'. Then ran the example 10 parameters above repeatedly with 'rand10nums = random_10params(...)', followed by 'rands = pd.Series(rand10nums)' and 'rands.describe(),rands.skew(),rands.kurtosis()'. I get very different describe() output each time, though no errors. Running my actual 10 sumstats as parameters from the above comment, I get ValueError. I understand now how mstate and interpolate are involved. Any ideas how to get more consistent output with my 10 actual (lognormal) data parameters?.Still no dist families though. – Thomas Gehrmann May 05 '20 at 02:50
  • ValueError is produced in the code when the optimization failed. Removing or uncommenting the line "if not result.success: raise ValueError" will still return a result even if the optimization fails, but the result may be erroneous. – Peter O. May 05 '20 at 03:01
  • Thank you Peter. I will research adjusting the criteria that create the ValueError. – Thomas Gehrmann May 05 '20 at 14:09
  • Taking another look at getting valid results with this code again. I can avoid the ValueError if I only target very small negative value results, though I still can't seem to produce output in the range of my 10 actual target output summary stats; (n=19876, mn=39, p25=116, p50=147, p75=186, mx=401, mean=154.1207, stdev=52.3257, skew=.7083, kurt=.5383). It seems I need to "scale" the code to accommodate this range of 10 target parameters "randomly", though each change I try "breaks" the code. How please must I adapt this code to achieve my 10 target output summary stats above? Thank you. T – Thomas Gehrmann May 27 '20 at 07:35
  • Can you explain why you need a random generation function with precisely those 10 parameters? Can you get away with using fewer parameters, such as just the percentiles and mean? If you have access to the actual data points, then there are better ways to generate random variables that follow that data, such as kernel density estimation, histograms, maximum likelihood estimation, etc. Did you try using these or other ways to generate random data points that follow the underlying data? Many of them may describe the underlying data better than a random generator based on "your 10 parameters". – Peter O. May 28 '20 at 08:33
  • Good questions which at the point of my intended test. kde, hist, mle all depend on a human "guessing" a dist family first, then training it optimally to the sample. I ask myself is, why does a machine actually need a dist family to define a model for simulation, other than that humans tell the machine to use it? Dist Families are human concepts, not machine. A machine should be able to faster create a more accurate model given enough parameters to accurately define the model,..without any dist family references used in the definition. 10 parameters seemed to encompass a complete definition. – Thomas Gehrmann May 29 '20 at 19:19
  • I would actually like for the tool (code/method) to be able to handle positive, negative, small, and large sample and output values eventually, which is why I started testing with me 10 target parameters above, which came from describing a Glucose level data sample by traditional dist family fitting, training, testing, then simulation experiments to get even closer by changing one parameter at a time until my 10 random model describe() summary stats matched as exactly as I could to my actual sample 10 summary stats..laborsome, which made me think, why do we use dist families?.human perception. – Thomas Gehrmann May 29 '20 at 19:24
  • I've found the following paper which may be relevant to your question: Hall, P., Bresnell, P., "Density Estimation Under Constraints", Journal of Computational and Graphical Statistics 8(2), 1999. Unfortunately, the paper's method seems to be based on the raw data points, while your question appears to be concerned with generating random numbers without reference to data. – Peter O. May 29 '20 at 20:02
  • I have edited the code in this answer to improve the solution. – Peter O. May 29 '20 at 21:44
  • Incredible. I wish I could make coding look this easy! More practice I guess. I will study the difference in the code tonight so I can understand what each new line is doing, then continue testing this weekend to share my results, though some initial quick runs look promising. Thank you very very much, Thomas – Thomas Gehrmann May 29 '20 at 22:55
  • With slight modifying, I compared sum differences from this direct method to traditional distribution fitting over a 100 count loop. This direct method was more accurate for Min, 25%, 50%, and StDev, though not as much for 75%, Max, Mean, Skew, and Kurtosis. Also, "while not success and tries < 10" never uses more than the 1st try. My chunk coding skill is improving, though my modifications seem to have the opposite impact than expected. What please are ways I might allow more tries or "space" for 75%, Max, Mean, Skew, and Kurtosis optimization, without increasing randomness subjectivity? T – Thomas Gehrmann Jun 01 '20 at 05:24