3

I need to solve a problem which entails comparing two matrices with the same number of columns. One of these is manipulated until the best match is obtained. The way I score the differences between the two matrices is quite convoluted and I still have to finalize it. What I'm really interested at the moment in is finding a search/optimization algorithm that works with positive integers only. I've created a simple example with a simple function to maximise. Let's say I have a dataset D.

 D <- data.frame(rbind(c(1,1,1),
                       c(1,1,0),c(1,1,0),c(1,1,0),c(1,0,0),
                       c(0,0,0),c(1,0,0),c(1,0,0),c(1,1,0),
                       c(1,0,0),c(1,1,1),c(1,1,0),c(1,0,0),
                       c(1,0,0),c(1,0,1)))

I want to find which re-arrangement of Dx gives me the lowest absolute difference.

Dx<-data.frame(rbind(c(1,1,0),c(1,0,0),c(0,0,0),c(1,1,0)))

So I could go through all the possible permutations using the function below

    library(combinat)
    SPACE <- t(as.data.frame(list(permn(1:3))))
    f <- function(x){
      if(anyDuplicated(x)>0){return(0)}
      Dist<-NA
      for (i in 1:nrow(D)){
        Dist[i]<-sum(abs(Dx[,x]-t(D[i,])))} 
    return(sum(Dist))}
apply(SPACE,1,f)

and get the right result.However this has 2 disadvantages for the data I'm actually using:

  1. I have to specify SPACE- all the possible column orders and
  2. apply goes through each possible permutations and calculates my error score.

Both A and B become computationally difficult as the number of columns in my matrix increases. I think even keeping all the possible permutations of the numbers 1 to 14 in one R session is impossible on most computers.

An optimization algorithm I found is grid search. This starts to address A. It means that I don't have to specify SPACE (i.e. all the possible permuatations), so it's one step in the right direction, as I want to look at much larger datasets.

library(NMOF)
gridSearch(f, rep(list(seq(1,ncol(D))),ncol(D)))

But obviously this does not address B, as it goes through each possible iteration. What if my dataset was very large, let's say 15 or even more columns?

Keeping in mind that my parameters can only be positive integers (i.e. they are column numbers), is there an R algorithm that would allow me to find the best column order (or at least a good approximation) within a reasonable amount of time (e.g. 1-2 days), when I'm dealing with much larger datasets? This may look like a silly example, but it emulates very well the problem I'm trying to solve. I've tried optim() with method="SANN", but got nowhere. Unfortunately I have very little experience so do let me know if you think this is an unworkable problem. Just to start with an easier dataset (few rows but lots of columns) problem, do you think it's possible to find the best column order as shown above for D2 by using some kind of clever optimization?

   #D2
D<-cbind(D,D,D,D,D)
ncol(D)
Dx<-cbind(Dx,Dx,Dx,Dx,Dx)
#examples 
f(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
f(c(13,2,4,3,5,6,7,8,9,10,11,12,1,14,15))

EDIT: My main interest is in understanding how to use optimization algorithms that use a series of unique positive integrals (basically ranks) in the search process, rather than solving this particular problem. I've used a simple example in this case so that it's easy to replicate, but the two datasets I'm comparing often differ in number of rows and other aspects which I'm not detailing here....the distance function I'm building handles this well so understanding how to apply an optimization algorithm (e.g. Genetic Algorithm was suggested below) to the function f above using D2 is therefore my main problem at the moment.

Mercelo
  • 231
  • 1
  • 5
  • 14
  • 2
    Maybe a Genetic Algorithm approach can help you? – Fernando Feb 25 '14 at 18:17
  • Fernando, thank you very much. Would you be able to show me how to apply such method to the f function above? – Mercelo Feb 26 '14 at 09:41
  • 1
    Check out the `GAPerm` function from the `gaoptim` package, it performs permutation optimization (TSP-like problems). – Fernando Feb 26 '14 at 14:57
  • Thanks. I've had a look and I don't quite understand how I would tell the function that I need positive integers... – Mercelo Feb 26 '14 at 15:58
  • I'll take a look and see what i can come up with, then i'll post here again. – Fernando Feb 26 '14 at 16:08
  • Fernando, I've looked into it. I think it sounds like a good algorithm. How would you do what I did with apply above (let's stick to the 3 column dataset) using GAPerm? I can't understand how to specify which numbers it should select from (e.g. c(1:3) ) Feel free to post it as answer...the bounty expires in 3 hours...Thanks – Mercelo Mar 06 '14 at 16:04
  • @Mercelo, I can propose an integer programming approach that works for the cost function `f` you provide in the edited problem, but it sounds like this is not your real cost function. The form of your cost function is critically important to solving this problem optimally. If your cost function can be integrated into an integer optimization model, then you may be able to efficiently obtain optimal solutions to your problem. If your cost function `f` is truly a black box, then you will need to resort to approximate solutions like a GA. – josliber Mar 06 '14 at 16:59

2 Answers2

3

If your objective function f must truly be seen as a black box, then we'll need to resort to approximate approaches such as a genetic algorithm. Here is a solution using the gaoptim package, which is maximizing f(p) among all permutations p of the columns of Dx:

library(gaoptim)
myGA = GAPerm(f, ncol(Dx), popSize=10)
myGA$evolve(10)
myGA
# Results for 10 Generations:
# Mean Fitness:
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    95.0   107.4   115.6   112.4   118.3   120.6 
# 
# Best Fitness:
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     125     125     125     125     125     125
# 
# Best individual:
# [1] 3 1 2
# 
# Best fitness value:
# [1] 125

In this case it found the best possible solution, with objective value 125, though in general there are no guarantees about the quality of the solution that will be returned by a genetic algorithm.

josliber
  • 43,891
  • 12
  • 98
  • 133
2

As I understand it, you are searching for the best assignments from a set of candidate columns to a set of target columns, and there is some cost associated with matching a candidate column to a target column. You are searching for the one-to-one matching that minimizes overall cost.

This is called the assignment problem, which is a classical problem in operations research. Your approach of grid search will have an exponential runtime (you need to search all possible assignments), but there are much more efficient approaches to this problem, many of which rely on linear programming.

You can solve your problem in R using the lp.assign function from the lpSolve package, providing your pairwise distances between the columns of your matrix:

# Build cost matrix
costs <- as.matrix(dist(t(D), method="manhattan"))
costs
#    X1 X2 X3
# X1  0  7 11
# X2  7  0  6
# X3 11  6  0

# Solve assignment problem
library(lpSolve)
solution <- lp.assign(costs)$solution
apply(solution > 0.999, 2, which)
# [1] 1 2 3

This means we selected permutation 1, 2, 3 as the most promising.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • Thank you very much for your answer. Unfortunately my distance algorithm looks at distance between rows of different datasets and is quite complicated so I can't create a neat cost matrix. Do you think there's a way to use the f function above in some kind of optimization engine? Apologies if this sounds a bit strange, and thanks again for trying to find a simpler solution, but I need to rely on a distance function that returns a number for a given permutation (a bit like f above). – Mercelo Feb 26 '14 at 12:27
  • 1
    The tools you would use would really depend on the cost function itself. Normally I would use linear or integer programming for a problem like this, but whether that's appropriate depends on the function. I suggest editing the question to give more details, since your minimal reproducible example seems to have been "too minimal." – josliber Feb 26 '14 at 15:10
  • Thanks for your comments. I've tried to give a better example. – Mercelo Feb 26 '14 at 16:52
  • I wonder if the technique you outlined could be used for what I need. Let's say we pick Dx as reference and then establish the distance between Dx and 200 column-rearrangements of itself: let's call these DxR1 to DxR200 where DxR200 is the rearrangement of Dx which is most different from the original Dx. We can then run function f on the first 200 solutions (possibly using optim?)- if DxR200 is the "best" column re-arrangements according to f then chances are we need to repeat the process but this time taking DxR200 as the reference. We then repeat the process until we reach a minimum.... – Mercelo Mar 06 '14 at 15:18
  • Just out of curiosity, looking at your solution for the assignment problem (which I know very little about) and not considering my initial question...is it possible to rank in ascending order the cost of all the combinations? Above you found the best candidate (1 2 3) using lp.assign which minimises the costs. This is great but let's say I wanted to look the cost of all the possible combinations...is there a formula to obtain this? And by extension if you used the much larger Dx (15 cols) how would I get the first 500 best candidates (or a good approximation of the best 500)? – Mercelo Mar 06 '14 at 21:34
  • Yes, you could formulate the assignment problem and solve to optimality. Then, you could add a constraint eliminating just the optimal solution and re-solve to optimality. You could repeat this any number of times. – josliber Mar 06 '14 at 21:39
  • Yes makes sense, but how do you add this constraint? For example in the example you gave above how would you ask lp.assign to not consider 1,2,3? Thanks again – Mercelo Mar 06 '14 at 21:46
  • 1
    You would need to formulate it as an optimization problem and solve using the `lp` function from the `lpSolve` package instead of the `lp.assign` convenience function. You could probably post it as a new question and get some help from the community. – josliber Mar 06 '14 at 22:50