optim() for multi variable returns results on the starting position in R

Question

I would like to use function optim() in R to minimise the target function. The two optimised parameters both have constrains.

I have created a test sampel data. Flow is a random series data separated by NAs. The function NAins() can be seen at the end of this question.

    flow = c(rep(NA,10),NAins(as.data.frame(runif(5000)), .1)$runif)
    rain = runif (length(flow))
    event = with(rle(!is.na(flow )),cbind(length=lengths[values],position=cumsum(c(1,lengths))[values]));

This function is to calculate r2.

    test_function = function(ndays, event, flow, rain,upboundary){
      flowvolume = rainvolume = raininweek = raininmonth =NULL;
      for (i in 1:(length(event)/2)){
        if (upboundary < event[,'position'][i]){
          flowvolume[i] = sum(flow[(event[,'position'][i]):(event[,'position'][i]+event[,'length'][i]-1)], na.rm = TRUE) # total flow during the non NA period
          rainvolume[i] = sum(rain[(event[,'position'][i]):(event[,'position'][i]+event[,'length'][i]-1)], na.rm = TRUE) # total rain during the non NA period
          raininweek[i] = sum(rain[(event[,'position'][i]-ndays[1]):(event[,'position'][i]-1)], na.rm = TRUE) #total rain imediate before NA with a constrained period of nday[1]
          raininmonth[i] = sum(rain[(event[,'position'][i]-ndays[2]-ndays[1]):(event[,'position'][i]-ndays[1]-1)], na.rm = TRUE) #total rain iprior to nday[1] 
        } else {next}
      }
      -summary(lm(flowvolume ~ rainvolume + raininweek + raininmonth))$r.squared # to minimise R2
    }

This is the optimisation with constrains.

    results= optim(par=c(2,20), lower=c(1,10), upper=c(10,30),method="L-BFGS-B",test_function, event=event, rain=rain, flow=flow,upboundary=30)

In this simulation, Results always converge to the staring position. If optim() is not a good choose in this question, could you recommend some other packages or function to use?

Here is the function used to create sample flow data with random NA in between.

    ################################################################
    # RANDOMLY INSERT A CERTAIN PROPORTION OF NAs INTO A DATAFRAME #
    ################################################################
    NAins <-  NAinsert <- function(df, prop){
      n <- nrow(df)
      m <- ncol(df)
      num.to.na <- ceiling(prop*n*m)
      id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
      rows <- id %/% m + 1
      cols <- id %% m + 1
      sapply(seq(num.to.na), function(x){
        df[rows[x], cols[x]] <<- NA
      }
      )
      return(df)
    }

Are you sure this example runs as you expect? When I copy/paste the code i get: "Error in event[, "position"] : incorrect number of dimensions" (R version 3.1.0). I think you want to use named parameters for event/flow/rain in the optim call and also you have some parenthesis problems with your r.squared extraction i think. — MrFlick, Nov 26 '14 at 06:01
Thanks, I tried to make it short and easy to read so deleted it, Wasn't knowing it gonna cause problem. — Yu Deng, Nov 26 '14 at 06:17
The flowvolume, rainvolume, raininweek and raininmonth are all vectors of size 4 for any value of ndays. This essentially means the lm function has 4 observations to estimate 4 parameters and hence, always returns an R squared value of 1. Hence, the function test_function always returns -1 irrespective of the value of ndays. Please run the function step-by-step with a specific value of ndays and see whether each step performs as expected. — Ravi, Nov 26 '14 at 06:49
My code actually have over 200 observation, so it would have enough data points to estimate for these. I have repeated the random flow data for 40 times. — Yu Deng, Nov 26 '14 at 07:00
Your `test_function` does not contain free variables, since you supply all of them by yourself. You are trying to optimize a constant essentially. It seems you'd like to optimize two variables in ranges 1-10 and 10-30, but what do they correspond to? — tonytonov, Dec 01 '14 at 07:29
@ tonytonov: I don't really get it. Isn't "ndays" a two valued vector? If I get rid of those three arguments (lower, upper, method), this line of code actually can produce pretty good estimation but I need those boundaries. — Yu Deng, Dec 01 '14 at 07:34

score 0 · Answer 1 · edited May 23 '17 at 12:27

It seems the optimization is never moving far enough away from the starting point because these parameters are implicitly integer. However optim does not know this. It just sees a flat gradient.

If your parameter space for ndays is small, as you indicate in your question, try enumerating over all these combinations instead. Here is a handy function. Hat tip to How to optimize for integer parameters (and other discontinuous parameter space) in R?.

library(NMOF)
grid<- gridSearch(test_function, list(ndays1=seq(1,10), ndays2=seq(10,22)),
                  event=event, rain=rain, flow=flow, upboundary=30)

grid$minfun
grid$minlevels

Notice I had to cut off part of the search space for ndays[2], because it resulted in a negative subscript error. You will need to add some checks in your function to test for negative subscripts.

Thanks I tried this method bfr, but since my actual search area is 5000, each run takes at least 20mins+, It's not really feasible. — Yu Deng, Dec 02 '14 at 23:24
It's just I have 800+ case to optimise, 20mins each case seem sa little bit too much. I am currently using function nmkb(), it's really fast but occasionally have stating position singular value issue. — Yu Deng, Dec 04 '14 at 00:30

score 0 · Answer 2 · edited May 23 '17 at 12:13

I think enumeration is the best option especially if you have few variables and a very nonlinear function. Nelder Mead or Hooke Jeeves are bound to give you only local solutions. The function here appears to be fairly nonlinear and quite flat in certain areas.

You could get some speedup using parallel packages like foreach and doParallel from Revolution Analytics. In the below example, I do a pure parallel implementation of exhaustive search. I have modified the test_function to also return x variable.

test_function2 = function(ndays, event, flow, rain,upboundary){
  flowvolume = rainvolume = raininweek = raininmonth =NULL;
  for (i in 1:(length(event)/2)){
    if (upboundary < event[,'position'][i]){
      flowvolume[i] = sum(flow[(event[,'position'][i]):(event[,'position'][i]+event[,'length'][i]-1)], na.rm = TRUE) # total flow during the non NA period
      rainvolume[i] = sum(rain[(event[,'position'][i]):(event[,'position'][i]+event[,'length'][i]-1)], na.rm = TRUE) # total rain during the non NA period
      raininweek[i] = sum(rain[(event[,'position'][i]-ndays[1]):(event[,'position'][i]-1)], na.rm = TRUE) #total rain imediate before NA with a constrained period of nday[1]
      raininmonth[i] = sum(rain[(event[,'position'][i]-ndays[2]-ndays[1]):(event[,'position'][i]-ndays[1]-1)], na.rm = TRUE) #total rain iprior to nday[1] 
    } else {next}
  }
  rsq=-summary(lm(flowvolume ~ rainvolume + raininweek + raininmonth))$r.squared # to minimise R2
  return(c(ndays,rsq))
}

x1<-c(1:10)
x2<-c(10:30)
a<-expand.grid(x1,x2)

library(foreach)
library(doParallel)

cl <- makePSOCKcluster(4)
registerDoParallel(cl)

mymin <-function(z1,z2) {
  if (z1[[3]]<=z2[[3]]) {
    return(z1)
  } else {
    return(z2)
  }
}

ptm<-proc.time()
#c<-matrix(foreach(i=1:210) %dopar% test_function(as.numeric(a[i,]),event,flow,rain,30),10)
c<-foreach(i=1:210,.combine=mymin) %dopar% test_function2(as.numeric(a[i,]),event,flow,rain,30)
proc.time()-ptm

stopCluster(cl)

The run time for this was around 4.6s

> ptm<-proc.time()
> #c<-matrix(foreach(i=1:210) %dopar% test_function(as.numeric(a[i,]),event,flow,rain,30),10)
  > c<-foreach(i=1:210,.combine=mymin) %dopar% test_function2(as.numeric(a[i,]),event,flow,rain,30)
> proc.time()-ptm
user  system elapsed 
0.211   0.030   4.596 
> c
[1]  1.0000000 11.0000000 -0.9363349

For the NMOF implementation it was 11s

> ptm<-proc.time()
> grid<- gridSearch(test_function, list(ndays1=seq(1,10), ndays2=seq(10,30)),
                    +                   event=event, rain=rain, flow=flow, upboundary=30)
2 variables with 10, 21 levels: 210 function evaluations required.
> proc.time()-ptm
user  system elapsed 
10.963   0.004  10.974
> grid$minfun
[1] -0.9363349
> grid$minlevels
[1]  1 11

I hope this helps. Please see documentation of foreach, if you plan to take this approach.

Another option for speedup is to use faster ways to solve the lm so you can cut down on the individual function call evaluation time. I see a few options in the below link:

How to compute minimal but fast linear regressions on each column of a response matrix?

Excuse me, but what is a[i,] in your foreach statement? – Yu Deng Dec 10 '14 at 07:25 — Yu Deng, Dec 10 '14 at 07:25
x1<-c(1:10); x2<-c(10:30); a<-expand.grid(x1,x2) – Dhanesh Dec 23 '14 at 05:30 — Dhanesh, Dec 23 '14 at 05:30

optim() for multi variable returns results on the starting position in R

2 Answers2