1

I have created an approxfun function from the Binsmooth package for finding means from binned data.

binedges <- c(10000,15000,20000,25000,30000,35000,40000,45000,
              50000,60000,75000,100000,125000,150000,200000,NA)
bincounts <- c(157532,97369,102673,100888,90835,94191,87688,90481,
               79816,153581,195430,240948,155139,9452,92166,103217)
splb <- splinebins(binedges, bincounts, 76091)

typing splb$splineCDF(x) will return y, but I want to find the median value.

I understand that this function is supposed to achieve this goal, but it doesn't appear to work for functions created with the Binsmooth package.

get x-value given y-value: general root finding for linear / non-linear interpolation function

I've put together a simple way that will find an approximate value, but it is not very satisfying and very computer intensive:


splb$splineCDF(50000)

fn(1000)

probability<- 0
income<- 0
while(probability< 0.5){
  probability<- splb$splineCDF(income)
  income<- income+ 10
}

Any ideas?

Cettt
  • 11,460
  • 7
  • 35
  • 58

1 Answers1

0

I'd be tempted to first try using a numerical optimiser to find the median for me, see if it works well enough. Validating in this case is easy by checking how close splb$splineCDF is to .5. You could add a test e.g. if abs(splb$splineCDF(solution) - .5) > .001 then stop the script and debug.

Solution uses optimize from the stats base R package

# manual step version
manual_version <- function(splb){
  probability<- 0
  income<- 0
  while(probability< 0.5){
    probability<- splb$splineCDF(income)
    income<- income+ 10
  }
  return(income)
}

# try using a one dimensional optimiser - see ?optimize
optim_version <- function(splb, plot=TRUE){
  # requires a continuous function to optimise, with the minimum at the median
  objfun <- function(x){
    (.5-splb$splineCDF(x))^2
  }

  # visualise the objective function
  if(plot==TRUE){
    x_range <- seq(min(binedges, na.rm=T), max(binedges, na.rm=T), length.out = 100)
    z <- objfun(x_range)
    plot(x_range, z, type="l", main="objective function to minimise")
  }

  # one dimensional optimisation to get point closest to .5 cdf
  out <- optimize(f=objfun, interval = range(binedges, na.rm=TRUE))

  return(out$minimum)
}

# test them out
v1 <- manual_version(splb)
v2 <- optim_version(splb, plot=TRUE)
splb$splineCDF(v1)
splb$splineCDF(v2)

# time them
library(microbenchmark)
microbenchmark("manual"={
  manual_version(splb)
}, "optim"={
  optim_version(splb, plot=FALSE)
}, times=50)
Jonny Phelps
  • 2,687
  • 1
  • 11
  • 20