I need to solve a problem which entails comparing two matrices with the same number of columns. One of these is manipulated until the best match is obtained. The way I score the differences between the two matrices is quite convoluted and I still have to finalize it. What I'm really interested at the moment in is finding a search/optimization algorithm that works with positive integers only. I've created a simple example with a simple function to maximise. Let's say I have a dataset D.
D <- data.frame(rbind(c(1,1,1),
c(1,1,0),c(1,1,0),c(1,1,0),c(1,0,0),
c(0,0,0),c(1,0,0),c(1,0,0),c(1,1,0),
c(1,0,0),c(1,1,1),c(1,1,0),c(1,0,0),
c(1,0,0),c(1,0,1)))
I want to find which re-arrangement of Dx gives me the lowest absolute difference.
Dx<-data.frame(rbind(c(1,1,0),c(1,0,0),c(0,0,0),c(1,1,0)))
So I could go through all the possible permutations using the function below
library(combinat)
SPACE <- t(as.data.frame(list(permn(1:3))))
f <- function(x){
if(anyDuplicated(x)>0){return(0)}
Dist<-NA
for (i in 1:nrow(D)){
Dist[i]<-sum(abs(Dx[,x]-t(D[i,])))}
return(sum(Dist))}
apply(SPACE,1,f)
and get the right result.However this has 2 disadvantages for the data I'm actually using:
- I have to specify SPACE- all the possible column orders and
apply
goes through each possible permutations and calculates my error score.
Both A and B become computationally difficult as the number of columns in my matrix increases. I think even keeping all the possible permutations of the numbers 1 to 14 in one R session is impossible on most computers.
An optimization algorithm I found is grid search. This starts to address A. It means that I don't have to specify SPACE (i.e. all the possible permuatations), so it's one step in the right direction, as I want to look at much larger datasets.
library(NMOF)
gridSearch(f, rep(list(seq(1,ncol(D))),ncol(D)))
But obviously this does not address B, as it goes through each possible iteration. What if my dataset was very large, let's say 15 or even more columns?
Keeping in mind that my parameters can only be positive integers (i.e. they are column numbers), is there an R algorithm that would allow me to find the best column order (or at least a good approximation) within a reasonable amount of time (e.g. 1-2 days), when I'm dealing with much larger datasets? This may look like a silly example, but it emulates very well the problem I'm trying to solve. I've tried optim()
with method="SANN"
, but got nowhere. Unfortunately I have very little experience so do let me know if you think this is an unworkable problem. Just to start with an easier dataset (few rows but lots of columns) problem, do you think it's possible to find the best column order as shown above for D2 by using some kind of clever optimization?
#D2
D<-cbind(D,D,D,D,D)
ncol(D)
Dx<-cbind(Dx,Dx,Dx,Dx,Dx)
#examples
f(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
f(c(13,2,4,3,5,6,7,8,9,10,11,12,1,14,15))
EDIT: My main interest is in understanding how to use optimization algorithms that use a series of unique positive integrals (basically ranks) in the search process, rather than solving this particular problem. I've used a simple example in this case so that it's easy to replicate, but the two datasets I'm comparing often differ in number of rows and other aspects which I'm not detailing here....the distance function I'm building handles this well so understanding how to apply an optimization algorithm (e.g. Genetic Algorithm was suggested below) to the function f above using D2 is therefore my main problem at the moment.