how to perform an automatic optimisation

Question

an example data can be found here. I had no choice but put a data online because iI could not generate a representative data,

And here is the code

myfunction <- function(df, parametr1, paramter2){
  do the stuff you want to do
}

I have two parameters to be optimised in this function

Don't use `subset`, use `[]`. Here is an example get the index: `which(c(0:1) > 1)`; get the value: `c(0,1)[ which(c(0:1) > 1)]`. `which` needs a logical expression as input. — lmo, May 13 '16 at 14:17
Use `str` to better understand the structure of your object. Then extract the list item that you are interested in. — lmo, May 13 '16 at 14:23
When you say you want to optimize it based on the fitted residuals, could you be a bit more specific? As far as I can see, there are many of them. Do you want to minimize the sum of squares, absolute deviation, or something else? — coffeinjunky, May 20 '16 at 15:00
@coffeinjunky based on any of them would be great! but normally SSE or sum of squares error is taken into account — nik, May 20 '16 at 15:01
Alright, but help me understand: the best fit will be the one without any smoothing. So, just not doing any filtering at all would give the best outcome in terms of the difference between input and fitted. Or if you have to do some smoothing, choose the smallest filter length possible. But this surely can't be your objective. Am I overlooking something? — coffeinjunky, May 20 '16 at 15:33
@coffeinjunky I think the best output will be the one with lowest residual. — nik, May 20 '16 at 16:16

score 2 · Accepted Answer · edited May 23 '17 at 12:31

Before I start the main part of the answer, a clarification: in the question, you say you want to minimize the "fitted residuals (the differences between the input signal and the smoothed signal)." It is almost never the case that it makes sense to minimise the sum of residuals, because this sum can be negative - therefore, atempting to minimise this function will result in the largest negative sum of residuals that can be found (and typically will not even converge, as the residuals can become infinitely negative). What is almost always done is to minimise the sum of squares of the residuals instead, and that is what I do here.

So let us start with a payoff function that returns the sum of sqaures of the residuals

payoff <- function(fl, forder) {
  M <- sav.gol(df[,1], fl = fl, forder = forder)
  resid2 <- (M-df[,1])^2
  sum(resid2)
}

Note that I do not pass df to this function as a parameter, but just access it from the parent scope. This is to avoid unnecessary copying of the dataframe each time the function is called to reduce memory and time overheads.

Now onto the main question, how do we minimise this function for integer values of 1 < fl < NROW(df) and forder in c(2,4)?

The Main difficulty here is that the optimisation parameters fl and forder are integers. Most typical optimisation methods (such as those used by optimize, optim, or nlm) are designed for continuous functions of continuous parameters. Discrete optimization is another thing entirely and approaches to this include methods such as genetic algorithms. For example, see these SO posts here and here for some approaches to integer or discrete optimisation.

There are no perfect solutions for discrete optimisation, particularly if you are looking for a global rather than a local minimum. In your case, the sumn of squares of residuals (payoff) function is not very well behaved and oscillates. So really we should be looking for a global minimum rather than local minima of which there can be many. The only certain way to find a global minimum is by brute force, and indeed we might take that approach here given that a brute force solution is computable in a reasonable length of time. The following code will calculate the payoff (objective) function at all values of fl and forder:

resord2 <- sapply(1:NROW(df),  FUN= function(x) payoff(x, 2))
resord4 <- sapply(1:NROW(df),  FUN= function(x) payoff(x, 4))

the time to calculate these functions increases only linearly with the number of rows of data. With your 43k rows this would take around a day and a half on my not particularly fast laptop computer.

Fortunately, however, we don't need to invest that much computation time. The following Figure shows the sum of squares of the residuals for forder=2 (blue line) and forder=4 (red line), for values of fl up 40.

resord2 <- sapply(1:40,  FUN= function(x) payoff(x, 2))
resord4 <- sapply(1:40,  FUN= function(x) payoff(x, 4))
plot(1:40, resord2, log="y", type="l", col="blue", xlab="dl", ylab="residual sum of squares")
lines(1:40, resord4, col="red")

It is clear that high values of fl lead to high residual sum of sqaures. Therefore we can limit the optimisation search to dl < max.dl. Here I use a max.dl of 40.

resord2 <- sapply(1:40,  FUN= function(x) payoff(x, 2))
resord4 <- sapply(1:40,  FUN= function(x) payoff(x, 4))
which.min(resord2)
# 3
which.min(resord4)
# 3

If we want to convince ourselves that the residual sum of squares really does increase with fl over a larger range, we can create a quick reality check using a larger range of values for fl incrementing in larger steps:

low_fl <- 10
high_fl <- 100
step_size <- 10
fl <- seq(low_fl, high_fl, by=step_size)
resord2 <- sapply(fl,  FUN= function(x) payoff(x, 2))
plot(fl, resord2)

If only you'd linked this article in your OP. The articles's method is v. diff. to the question in OP. What the authors do is not minimize the residuals (which makes no sense, because that equals no smoothing at all). As the authors note "balance must be found between noise removal and signal distortion". So instead they minimize difference between autocorrelations of the residuals and of a detector blank. To do this, "the only parameter that must be known in advance is the auto-correlation function of the noise of the detector." You do not provide this detector noise autocorrelation. — dww, May 20 '16 at 17:21
I did not provide that because i did not want to do exactly what they asked. I liked my own approach means minimising the error because then I will use it on replicate sample and it make sense. the blank sample will have a white noise means no spike at all. so from that point of view, it does not provide a great solution. am I right ? — nik, May 20 '16 at 17:29
If you are looking for programming help, this is the place to be. My answer shows you how to code a solution to the question you asked. Whether the underlying statistics of what you are attempting make sense is not really an SO question, but something that you should discuss perhaps on cross validated, or perhaps with a statistician at your institution. Without more information, I can't be certain if your approach is good, although my hunch is that it does not sound quite right and you may be better sticking with a published methodology. — dww, May 20 '16 at 18:09
please add something else and i'll accept your solution. evaluation of 1 window size, 10 window size , 50 and 100 window size. what you showed was 1:40 one by one. I want I to end by 10, by 50 and 100. — nik, May 20 '16 at 19:03
I accepted it . may I ask you a favour? now I have only one varied parameter during optimisation, what if I had two ? here one is changing from 1 to length of my sample and the other one is only 2 or 4; my means what if I had a minimum and maximum for both parameters ? — nik, May 20 '16 at 21:01
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/112545/discussion-between-dww-and-nik). — dww, May 20 '16 at 21:09

Soul Donut · Answer 2 · 2016-05-20T18:03:35.177

1

If I understand you correctly, you want to minimize the difference between T2 and your input, T.

First, edit sav.gol so that the last line is return(T2) (I thought this wasn't necessary but it appears to be the case).

The below are not going to give you what I presume you'd like the answers to be, since it will pick a filtering close to 0, but this is an efficient way of finding the minima.

# make your residual function
resid <- function(fl, T, fo){
  fl <- round(fl)
  T2 <- sav.gol(T, fl, forder = fo, dorder = 0)
  sum((T2 - T) ^ 2)
}

# optimize for order 2
results_2 <- optimize(f = resid, interval = c(1, length(df[, 1])),
                      T = df[, 1], fo = 2)

# optimize for order 4
results_4 <- optimize(f = resid, interval = c(1, length(df[, 1])),
                      T=df[, 1], fo = 4)

Edit:

I plotted the values of the residuals for order 2 as fl increases (up to 10k); optimize appears to find the first local minimum and may need some tweaking. I'd recommend that you specify in your answer some lower bound and upper for your level of filtering. Then, you could use optimize with a different interval.

library(ggplot)
resids <- sapply(1:10000, resid, T = df[1:10000, 1], fo = 2)
qplot(seq_along(resids), resids / 10^10,
      xlab = 'fl', ylab = 'Residual', geom='line')

By the way, the above sapply works as an alternative to optimize. You can run it on the full dataset,

resids <- sapply(1:length(df[,1], resid, T = df[, 1], fo = 2)

It looks like it will take about 10 hours on my laptop, but once it's done you can plot it as above and choose your filtering level by eye (or do which.min(resids[lb:ub]) where lb and ub are your lower and upper bounds, respectively).

edited May 20 '16 at 18:03

answered May 20 '16 at 14:00

Soul Donut

357
3
12

Your function should return a single scalar value that you want to minimize based on these parameters and other data, so your function should take a matrix of your parameters as its first argument, then you can take them out of the matrix inside your function. So say your two arguments you want to estimate are vectors called `arg1` and `arg1`. Supposing they are of the same length, you set some sensible initial values for them then put them in a matrix like this `m <- cbind(arg1, arg2)`. In your function you'd pull them back out. – Soul Donut May 20 '16 at 14:09
Look at my original data , it is a matrix of 3 columns and many rows. The function is sav.gol <- function(df, fl, forder=4, dorder=0) so I have my df and then I have fl which is many values from 1 to the length of the rows of my data and former which is either 2 or 4 and order is always 0. I tried to use your approach but gave me a lot of error , it is because you are giving me an example for another function which nothing to do with mine – nik May 20 '16 at 14:18
1

@HoHo, I think the input arguments over which to maximize need to be discrete. This is somewhat unclear from the OP, but I think `fl` denotes a number of rows. – coffeinjunky May 20 '16 at 14:57
@HoHo I liked your answer and i appreciate your effort but I don't want to use continues optimisation rather discrete – nik May 20 '16 at 17:38
That's why I've included the `round` -- it ultimately does perform a discrete optimization. Alternatively, `sapply` is truly discrete, it will just take longer (although you can use the `doParellel` package and `foreach` to speed things up.) – Soul Donut May 20 '16 at 17:56
pretty sure the discrete version with sapply is the same as the answer I already provided – dww May 20 '16 at 18:05
@dww, I had `sapply` as a comment on my second answer (which I deleted and added to this one) prior to your answer. `optim` is still faster and remains discrete in my above implementation. – Soul Donut May 20 '16 at 18:45
@HoHo now only one variable varies 1:length(df) what if I have two parameters ? can I still use it? – nik May 20 '16 at 20:49
Since one of your parameters only can take two values, I would just run it twice, then look at which of the two is smaller. You could condense it further: `lapply(c(2, 4), optimize, f = resid, interval = c(1, length(df[, 1])), T = df[, 1])`, or nest the `sapply` instead of the `optimize`. – Soul Donut May 20 '16 at 21:01

how to perform an automatic optimisation

2 Answers2

Edit: