6

Consider the set of non-decreasing surjective (onto) functions from (-inf,inf) to [0,1]. (Typical CDFs satisfy this property.) In other words, for any real number x, 0 <= f(x) <= 1. The logistic function is perhaps the most well-known example.

We are now given some constraints in the form of a list of x-values and for each x-value, a pair of y-values that the function must lie between. We can represent that as a list of {x,ymin,ymax} triples such as

constraints = {{0, 0, 0}, {1, 0.00311936, 0.00416369}, {2, 0.0847077, 0.109064}, 
 {3, 0.272142, 0.354692}, {4, 0.53198, 0.646113}, {5, 0.623413, 0.743102}, 
 {6, 0.744714, 0.905966}}

Graphically that looks like this:

constraints on a cdf
(source: yootles.com)

We now seek a curve that respects those constraints. For example:

fitted cdf
(source: yootles.com)

Let's first try a simple interpolation through the midpoints of the constraints:

mids = ({#1, Mean[{#2,#3}]}&) @@@ constraints
f = Interpolation[mids, InterpolationOrder->0]

Plotted, f looks like this:

interpolated cdf
(source: yootles.com)

That function is not surjective. Also, we'd like it to be smoother. We can increase the interpolation order but now it violates the constraint that its range is [0,1]:

interpolated cdf with higher interpolation order
(source: yootles.com)

The goal, then, is to find the smoothest function that satisfies the constraints:

  1. Non-decreasing.
  2. Tends to 0 as x approaches negative infinity and tends to 1 as x approaches infinity.
  3. Passes through a given list of y-error-bars.

The first example I plotted above seems to be a good candidate but I did that with Mathematica's FindFit function assuming a lognormal CDF. That works well in this specific example but in general there need not be a lognormal CDF that satisfies the constraints.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
dreeves
  • 26,430
  • 45
  • 154
  • 229
  • 1
    I guess a CDF is a Cumulative Distribution Function. – brainjam Apr 24 '10 at 00:36
  • Yes, thank you. Though the question is more general -- a CDF is just a function that has range [0,1] and is non-decreasing. – dreeves Apr 24 '10 at 00:54
  • I cleaned up the question a bunch; thanks everyone! – dreeves Apr 24 '10 at 04:30
  • 2
    Because you have ranges I feel there is a way to do this with a finite sum of arctan functions (which would of course be infinitely differentiable). If someone else wants to run with this answer feel free; otherwise I'll see if inspiration strikes tomorrow. While such an answer would strictly speaking be smoother than the cubic splines, it may satisfy the word of smoothness and not the spirit. That is it will look crassly stepwise. – Steven Noble Apr 24 '10 at 06:45
  • 1
    I think you are missing an important point. The error bars are indicators of certainty and by just taking the mean value A as the point to be fit you are throwing away a lot of information. The fitting algorithm will treat deviation from A and B (another mean value) on equal basis, even if the error for B is a much larger than for A. E.g. your points at x=0,1,2 should be given much higher priority to be "on the curve" than the other points. Most Mathematica fitting functions have the option Weights which I would use to assign the inverse of the error interval to each mean point as weight. – Timo Apr 24 '10 at 10:04
  • 1
    Damns! Read your post again and they are not error bars but constraints. – Timo Apr 24 '10 at 10:07
  • Don't feel too bad @Timo, part of a solution might be to incorporate the constraints as weights, going from 0 at the mid-point to some large number (possibly infinity) at the end-points. – High Performance Mark Apr 24 '10 at 11:49
  • 3
    I think I'll pass on providing a full write up of my arctan solution simply because I quite like ~unutbu's interpolation solution (particularly with the transform). You may want to check out http://blog.noblemail.ca/2009/03/improved-smooth-curves-in-javascript.html where I've implemented some locally monotonic splines like this in js. Perhaps it will inspire some mathematica code. – Steven Noble Apr 24 '10 at 21:17
  • Steven, that looks really well done. Thanks for the pointer. I agree about unutbu's solution. I posted a solution of my own in the meantime -- probably not as good as unutbu's. – dreeves Apr 24 '10 at 22:08

3 Answers3

5

I don't think you've specified enough criteria to make the desired CDF unique.

If the only criteria that must hold is:

  1. CDF must be "fairly smooth" (see below)
  2. CDF must be non-decreasing
  3. CDF must pass through the "error bar" y-intervals
  4. CDF must tend toward 0 as x --> -Infinity
  5. CDF must tend toward 1 as x --> Infinity.

then perhaps you could use Monotone Cubic Interpolation. This will give you a C^2 (twice continously differentiable) function which, unlike cubic splines, is guaranteed to be monotone when given monotone data.

This leaves open the question, exactly what data should you use to generate the monotone cubic interpolation. If you take the center point (mean) of each error bar, are you guaranteed that the resulting data points are monotonically increasing? If not, you might as well make some arbitrary choice to guarantee that the points you select are monotonically increasing (because the criteria does not force our solution to be unique).

Now what to do about the last data point? Is there an X which is guaranteed to be larger than any x in the constraints data set? Perhaps you can again make an arbitrary choice of convenience and pick some very large X and put (X,1) as the final data point.

Comment 1: Your problem can be broken into 2 sub-problems:

  1. Given exact points (x_i,y_i) through which the CDF must pass, how do you generate CDF? I suspect there are infinitely many possible solutions, even with the infinite-smoothness constraint.

  2. Given y-errorbars, how should you pick (x_i,y_i)? Again, there infinitely many possible solutions. Some additional criteria may need to be added to force a unique choice. Additional criteria would also probably make the problem even harder than it currently is.

Comment 2: Here is a way to use monotonic cubic interpolation, and satisfy criteria 4 and 5:

The monotonic cubic interpolation (let's call it f) maps R --> R.

Let CDF(x) = exp(-exp(f(x))). Then CDF: R --> (0,1). If we could find the appropriate f, then by defining CDF this way, we could satisfy criteria 4 and 5.

To find f, transform the CDF constraints (x_0,y_0),...,(x_n,y_n) using the transformation xhat_i = x_i, yhat_i = log(-log(y_i)). This is the inverse of the CDF transformation. If the y_i's were increasing, then the yhat_i's are decreasing.

Now apply monotone cubic interpolation to the (x_hat,y_hat) data points to generate f. Then finally, define CDF(x) = exp(-exp(f(x))). This will be a monotonically increasing function from R --> (0,1), which passes through the points (x_i,y_i).

This, I think, satisfies all the criteria 2--5. Criteria 1 is somewhat satisfied, though there certainly could exist smoother solutions.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks for this answer! I didn't know about Monotone Cubic Interpolation. If anyone knows of a Mathematica implementation of that, please post it as an answer. You're right that the criteria do not uniquely determine a CDF; thanks for clarifying what the criteria are! (Btw, your #4 is actually subsumed by #3 -- note the first degenerate error bar in my list of example constraints.) – dreeves Apr 24 '10 at 03:38
  • As for your proposed hack for criterion 5: I tried that with Mathematica's normal cubic interpolation (not enforcing monotonicity, though that didn't seem to be an issue) and didn't find it satisfying. I don't want the interpolating function to exceed 1 for any X. True, X can be made big enough that it doesn't matter in practice, but then the function doesn't go to 1 quickly enough. – dreeves Apr 24 '10 at 03:38
  • What if we stated the problem as "find the smoothest possible non-decreasing function from (-inf,inf) to [0,1] such that it passes through a given list of error bars"? Maybe I'll edit the question to put it that way. – dreeves Apr 24 '10 at 03:45
  • I've now reworked the question, thanks in part to your answer. – dreeves Apr 24 '10 at 04:31
  • Wow, your transformation idea in Comment 2 is really clever. I hadn't seen that when I wrote my solution (now posted as an answer and perhaps sufficient for my immediate purposes). Thanks again for all the help on this! – dreeves Apr 24 '10 at 21:38
4

I have found a solution that gives reasonable results for a variety of inputs. I start by fitting a model -- once to the low ends of the constraints, and again to the high ends. I'll refer to the mean of these two fitted functions as the "ideal function". I use this ideal function to extrapolate to the left and to the right of where the constraints end, as well as to interpolate between any gaps in the constraints. I compute values for the ideal function at regular intervals, including all the constraints, from where the function is nearly zero on the left to where it's nearly one on the right. At the constraints, I clip these values as necessary to satisfy the constraints. Finally, I construct an interpolating function that goes through these values.

My Mathematica implementation follows.
First, a couple helper functions:

(* Distance from x to the nearest member of list l. *)
listdist[x_, l_List] := Min[Abs[x - #] & /@ l]

(* Return a value x for the variable var such that expr/.var->x is at least (or
   at most, if dir is -1) t. *)
invertish[expr_, var_, t_, dir_:1] := Module[{x = dir},
  While[dir*(expr /. var -> x) < dir*t, x *= 2];
  x]

And here's the main function:

(* Return a non-decreasing interpolating function that maps from the
   reals to [0,1] and that is as close as possible to expr[var] without
   violating the given constraints (a list of {x,ymin,ymax} triples).
   The model, expr, will have free parameters, params, so first do a
   model fit to choose the parameters to satisfy the constraints as well
   as possible. *)
cfit[constraints_, expr_, params_, var_] := 
Block[{xlist,bots,tops,loparams,hiparams,lofit,hifit,xmin,xmax,gap,aug,bests},
  xlist = First /@ constraints;
  bots = Most /@ constraints; (* bottom points of the constraints *)
  tops = constraints /. {x_, _, ymax_} -> {x, ymax};
  (* fit a model to the lower bounds of the constraints, and 
     to the upper bounds *)
  loparams = FindFit[bots, expr, params, var];
  hiparams = FindFit[tops, expr, params, var];
  lofit[z_] = (expr /. loparams /. var -> z);
  hifit[z_] = (expr /. hiparams /. var -> z);
  (* find x-values where the fitted function is very close to 0 and to 1 *)
  {xmin, xmax} = {
    Min@Append[xlist, invertish[expr /. hiparams, var, 10^-6, -1]],
    Max@Append[xlist, invertish[expr /. loparams, var, 1-10^-6]]};
  (* the smallest gap between x-values in constraints *)
  gap = Min[(#2 - #1 &) @@@ Partition[Sort[xlist], 2, 1]];
  (* augment the constraints to fill in any gaps and extrapolate so there are 
     constraints everywhere from where the function is almost 0 to where it's 
     almost 1 *)
  aug = SortBy[Join[constraints, Select[Table[{x, lofit[x], hifit[x]}, 
                                              {x, xmin,xmax, gap}], 
                                        listdist[#[[1]],xlist]>gap&]], First];
  (* pick a y-value from each constraint that is as close as possible to 
     the mean of lofit and hifit *)
  bests = ({#1, Clip[(lofit[#1] + hifit[#1])/2, {#2, #3}]} &) @@@ aug;
  Interpolation[bests, InterpolationOrder -> 3]]

For example, we can fit to a lognormal, normal, or logistic function:

g1 = cfit[constraints, CDF[LogNormalDistribution[mu,sigma], z], {mu,sigma}, z]
g2 = cfit[constraints, CDF[NormalDistribution[mu,sigma], z], {mu,sigma}, z]
g3 = cfit[constraints, 1/(1 + c*Exp[-k*z]), {c,k}, z]

Here's what those look like for my original list of example constraints:

constrained fit to lognormal, normal, and logistic function
(source: yootles.com)

The normal and logistic are nearly on top of each other and the lognormal is the blue curve.

These are not quite perfect. In particular, they aren't quite monotone. Here's a plot of the derivatives:

Plot[{g1'[x], g2'[x], g3'[x]}, {x, 0, 10}]

the derivatives of the fitted functions
(source: yootles.com)

That reveals some lack of smoothness as well as the slight non-monotonicity near zero. I welcome improvements on this solution!

Community
  • 1
  • 1
dreeves
  • 26,430
  • 45
  • 154
  • 229
  • I noticed the obvious smoothness problems go away when I use the option Method->"Spline" in the Interpolation call. I don't yet know how to fix the monotonicity problem though. – dreeves Apr 24 '10 at 21:58
  • Actually, just setting InterpolationOrder->1 (linear interpolation) in the Interpolation call solves the monotonicity problem but of course makes the curve much less smooth. It's still roughly the same shape though and could be improved to an arbitrary degree by increasing the sampling granularity of the fitted function before interpolating. – dreeves Apr 24 '10 at 22:36
0

You can try to fit a Bezier curve through the midpoints. Specifically I think you want a C2 continuous curve.

i_am_jorf
  • 53,608
  • 15
  • 131
  • 222
  • Thanks jeff. I think the tricky part is what I've now listed as criterion 2 in my revised version of the question, or what unutbu lists as criterion 5: the function should tend to 1 as x goes to infinity. – dreeves Apr 24 '10 at 04:42