I am using the GA Package to minimize a function. Below are few stages that I have implemented.
0. Libraries and dataset
library(clusterSim) ## for index.DB()
library(GA) ## for ga()
data("data_ratio")
dataset2 <- data_ratio
set.seed(555)
1. Binary encoding and generate initial population.
initial_population <- function(object) {
## generate a population where for each individual, there will be number of 1's fixed between three to six
population <- t(replicate(object@popSize, {i <- sample(3:6, 1); sample(c(rep(1, i), rep(0, object@nBits - i)))}))
return(population)
}
2. Fitness Function Minimizes Davies-Bouldin (DB) Index.
DBI2 <- function(x) {
## number of 1's will represent the initial selected centroids and hence the number of clusters
cl <- kmeans(dataset2, dataset2[x == 1, ])
dbi <- index.DB(dataset2, cl=cl$cluster, centrotypes = "centroids")
score <- -dbi$DB
return(score)
}
3. User defined crossover operator. This method of crossover will avoid situations where no clusters are 'turned-on'. The pseudocode can be found here.
pairwise_crossover <- function(object, parents){
fitness <- object@fitness[parents]
parents <- object@population[parents, , drop = FALSE]
n <- ncol(parents)
children <- matrix(as.double(NA), nrow = 2, ncol = n)
fitnessChildren <- rep(NA, 2)
## finding the min no. of 1's between 2 parents
m <- min(sum(parents[1, ] == 1), sum(parents[2, ] == 1))
## generate a random int from range(1,m)
random_int <- sample(1:m, 1)
## randomly select 'random_int' gene positions with 1's in parent[1, ]
random_a <- sample(1:length(parents[1, ]), random_int)
## randomly select 'random_int' gene positions with 1's in parent[1, ]
random_b <- sample(1:length(parents[2, ]), random_int)
## union them
all <- sort(union(random_a, random_b))
## determine the union positions
temp_a <- parents[1, ][all]
temp_b <- parents[2, ][all]
## crossover
parents[1, ][all] <- temp_b
children[1, ] <- parents[1, ]
parents[2, ][all] <- temp_a
children[2, ] <- parents[2, ]
out <- list(children = children, fitness = fitnessChildren)
return(out)
}
4. Mutation.
k_min <- 2
k_max <- ceiling(sqrt(75))
my_mutation <- function(object, parent){
pop <- parent <- as.vector(object@population[parent, ])
for(i in 1:length(pop)){
if((sum(pop == 1) < k_max) && pop[i] == 0 | (sum(pop == 1) > k_min && pop[i] == 1)) {
pop[i] <- abs(pop[i] - 1)
return(pop)
}
}
}
5. Putting the pieces together. Using roulette-wheel selection, crossover prob. = 0.8, mutation prob. = 0.1
g2<- ga(type = "binary",
population = initial_population,
fitness = DBI2,
selection = ga_rwSelection,
crossover = pairwise_crossover,
mutation = my_mutation,
pcrossover = 0.8,
pmutation = 0.1,
popSize = 100,
nBits = nrow(dataset2))
I have created my initial population in a way that for each individual in the population, there will be number of 1's
fixed between three to six. The crossover and mutation operator is designed to ensure the solution do not end up having too many clusters (1's
) being 'turned-on'. I have tried out my crossover and mutation functions separately before integrating them, and they seem to be working fine.
Ideally, the final solution will have number of 1's
+-=1 from the initial population,i.e, if an individual has three 1's
in its chromosome, it will end up randomly having either two, three or four 1's
. But I got this solution instead, which shows 12 clusters (1's
) being 'turned-on', which means the crossover and mutation operators did out went well.
> sum(g2@solution==1)
[1] 12
The problem here is reproducible by copying all the code. Anyone familiar with the GA Package can help me out here?
[EDITED]
Trying with a different dataset iris
, got me into the following error. (Changed only the data, the rest of the settings remained)
0. Libraries and dataset
library(clusterSim) ## for index.DB()
library(GA) ## for ga()
## removed last column since it is a categorical data
dataset2 <- iris[-5]
set.seed(555)
> Error in kmeans(dataset2, centers = dataset2[x == 1, ]) :
initial centers are not distinct
I tried looking into the code, and found out this error was caused by if(any(duplicated(centers)))
. What would it possibly mean?