3

I am using the GA Package to minimize a function. Below are few stages that I have implemented.

0. Libraries and dataset

library(clusterSim)      ## for index.DB()
library(GA)              ## for ga() 
data("data_ratio")
dataset2 <- data_ratio
set.seed(555)

1. Binary encoding and generate initial population.

initial_population <- function(object) {
    ## generate a population where for each individual, there will be number of 1's fixed between three to six
    population <- t(replicate(object@popSize, {i <- sample(3:6, 1); sample(c(rep(1, i), rep(0, object@nBits - i)))}))
    return(population)
}

2. Fitness Function Minimizes Davies-Bouldin (DB) Index.

DBI2 <- function(x) {
    ## number of 1's will represent the initial selected centroids and hence the number of clusters
    cl <- kmeans(dataset2, dataset2[x == 1, ])
    dbi <- index.DB(dataset2, cl=cl$cluster, centrotypes = "centroids")
    score <- -dbi$DB

    return(score)
}

3. User defined crossover operator. This method of crossover will avoid situations where no clusters are 'turned-on'. The pseudocode can be found here.

pairwise_crossover <- function(object, parents){
    fitness <- object@fitness[parents]
    parents <- object@population[parents, , drop = FALSE]
    n <- ncol(parents)
    children <- matrix(as.double(NA), nrow = 2, ncol = n)
    fitnessChildren <- rep(NA, 2)

    ## finding the min no. of 1's between 2 parents
    m <- min(sum(parents[1, ] == 1), sum(parents[2, ] == 1))
    ## generate a random int from range(1,m)
    random_int <- sample(1:m, 1)
    ## randomly select 'random_int' gene positions with 1's in parent[1, ]
    random_a <- sample(1:length(parents[1, ]), random_int)
    ## randomly select 'random_int' gene positions with 1's in parent[1, ]
    random_b <- sample(1:length(parents[2, ]), random_int)
    ## union them
    all <- sort(union(random_a, random_b))
    ## determine the union positions
    temp_a <- parents[1, ][all]
    temp_b <- parents[2, ][all]

    ## crossover
    parents[1, ][all] <- temp_b
    children[1, ] <- parents[1, ]
    parents[2, ][all] <- temp_a
    children[2, ] <- parents[2, ]

    out <- list(children = children, fitness = fitnessChildren)
    return(out)
}

4. Mutation.

k_min <- 2
k_max <- ceiling(sqrt(75))

my_mutation <- function(object, parent){
    pop <- parent <- as.vector(object@population[parent, ])
    for(i in 1:length(pop)){
            if((sum(pop == 1) < k_max) && pop[i] == 0 | (sum(pop == 1) > k_min && pop[i] == 1)) {
                    pop[i] <- abs(pop[i] - 1)
                    return(pop) 
            }

    }

}

5. Putting the pieces together. Using roulette-wheel selection, crossover prob. = 0.8, mutation prob. = 0.1

g2<- ga(type = "binary", 
    population = initial_population, 
    fitness = DBI2, 
    selection = ga_rwSelection,
    crossover = pairwise_crossover,
    mutation = my_mutation,
    pcrossover = 0.8,
    pmutation = 0.1,
    popSize = 100, 
    nBits = nrow(dataset2))

I have created my initial population in a way that for each individual in the population, there will be number of 1's fixed between three to six. The crossover and mutation operator is designed to ensure the solution do not end up having too many clusters (1's) being 'turned-on'. I have tried out my crossover and mutation functions separately before integrating them, and they seem to be working fine.

Ideally, the final solution will have number of 1's +-=1 from the initial population,i.e, if an individual has three 1's in its chromosome, it will end up randomly having either two, three or four 1's. But I got this solution instead, which shows 12 clusters (1's) being 'turned-on', which means the crossover and mutation operators did out went well.

> sum(g2@solution==1)
[1] 12

The problem here is reproducible by copying all the code. Anyone familiar with the GA Package can help me out here?

[EDITED]

Trying with a different dataset iris, got me into the following error. (Changed only the data, the rest of the settings remained)

0. Libraries and dataset

library(clusterSim)      ## for index.DB()
library(GA)              ## for ga() 
## removed last column since it is a categorical data
dataset2 <- iris[-5]
set.seed(555)

> Error in kmeans(dataset2, centers = dataset2[x == 1, ]) : 
  initial centers are not distinct

I tried looking into the code, and found out this error was caused by if(any(duplicated(centers))). What would it possibly mean?

Community
  • 1
  • 1
jacky_learns_to_code
  • 824
  • 3
  • 11
  • 29

1 Answers1

2

A few points worth mentioning:

  1. In crossover, in order to randomly select 'random_int' gene positions with 1's in parent[1, ] you change the following line of code from

    random_a <- sample(1:length(parents[1, ]), random_int)

    to

    random_a <- sample(which(parents[1, ]==1), random_int)

    and similarly for the other parernt.

    However, this crossover strategy I think guarantees that any offspring can have total number of cluster bits turned on at most as the maximum number of 1 bits of its parents (which can be 6 in this case from the initial population, should not it be 4 if you want just 1 bit difference in the solution gene?).

    The following figure shows 3 randomly selected positions where at least one of the parent genes have 1 bit, while crossover and the offspring generated.

enter image description here

  1. In the mutation function, I think, to be more explicit, we should change this line of code

    if((sum(pop == 1) < k_max) && pop[i] == 0 | (sum(pop == 1) > k_min && pop[i] == 1))

    by

    if((sum(pop == 1) < k_max && pop[i] == 0) | (sum(pop == 1) > k_min && pop[i] == 1))

    with proper parenthesis.

  2. Also, it seems that your fitness function (Davies-Bouldin's index measuring cluster separation) favors more clusters to be turned on.

Finally I think it's the mutation that's the culprit, if you change k_max to a low value (e.g., 3) and pmutation to a low value (e.g., pmutation = 0.01), you will find in the final solutions all the genes have 4 bits turned on.

[EDITED]

set.seed(1234)
k_min = 2
k_max = 3 #ceiling(sqrt(75))
#5. Putting the pieces together. Using roulette-wheel selection, crossover prob. = 0.8, mutation prob. = 0.1
g2<- ga(type = "binary", 
        population = initial_population, 
        fitness = DBI2, 
        selection = ga_rwSelection,
        crossover = pairwise_crossover,
        mutation = my_mutation,
        pcrossover = 0.8,
        pmutation = 0.01,
        popSize = 100, 
        nBits = nrow(dataset2))    

g2@solution # there are 6 solution genes
    x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37
[1,]  0  0  0  0  0  0  1  0  1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[2,]  0  0  0  0  0  0  1  0  0   0   1   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[3,]  0  0  0  0  0  0  1  0  0   0   1   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[4,]  0  0  0  0  0  0  1  1  0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[5,]  0  0  0  1  0  0  0  0  0   0   1   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[6,]  0  0  0  1  0  0  0  0  0   0   1   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61 x62 x63 x64 x65 x66 x67 x68 x69 x70 x71 x72
[1,]   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[2,]   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[3,]   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[4,]   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[5,]   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[6,]   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     x73 x74 x75
[1,]   0   0   0
[2,]   0   0   0
[3,]   0   0   0
[4,]   0   0   0
[5,]   0   0   0
[6,]   0   0   0

rowSums(g2@solution) # all of them have 4 bits on
#[1] 4 4 4 4 4 4

[EDIT2]

  1. Actually, the crossover strategy guarantees that there is no extra bit turned on combining the parents, i.e., total number of 1 bits in the children = total number of 1 bits in the parents, but any individual child can have more bits turned on. Example where an individual offspring can have more bits turned on than any of the parents is shown below:

enter image description here

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • 1
    thank you sir for pointing out my mistake! Yes, I do agree on your point on the `crossover` operator, my intention is to test different initial centroids for the clustering algorithm. And regarding `mutation`, I have tried with `k_max = 3` and `pmutation = 0.01`, while checking the best solution `g2@solution`, I found out that there are 12 bits (clusters) being turned on. Why is it so? Would it be somehow related to point #3 that you pointed out? – jacky_learns_to_code Feb 23 '17 at 08:53
  • 1
    @jacky_learns_to_code Yes it may be due to that, added some reproducible code, where the final solutions contain only 4 bits set. – Sandipan Dey Feb 23 '17 at 08:57
  • hi Sir, it is true that the `crossover` guarantees the offspring to have at most the maximum number of 1 bits of its parents. But my concern is, after checking `rowSums(g2@solution)`, with mutation turned off (`pmutation = 0`), I noticed the number of 1 bits is always greater than the max. number of 1 bits of its parent, why is it the case? – jacky_learns_to_code Feb 24 '17 at 02:31
  • 1
    @ jacky_learns_to_code good catch. Actually, the `crossover` strategy guarantees that there is no extra bit turned on combining the parents, i.e., total number of 1 bits in the children = total number of 1 bits in the parents, but any individual child can have more bits turned on, added an example, check it out. – Sandipan Dey Feb 24 '17 at 05:29
  • 1
    totally make sense with the example, thanks Sir! At the end, the `crossover`'s behavior heavily depend on the random indices – jacky_learns_to_code Feb 24 '17 at 05:53
  • the solution does not work on all datasets, I have tried with `iris` dataset and went into an error, added an example. – jacky_learns_to_code Feb 26 '17 at 02:44