0

I'm trying to perform k-means on a dataframe with 69 columns and 1000 rows. First, I need to decide upon the optimal numbers of clusters first with the use of the Davies-Bouldin index. This algorithm requires that the input should be in the form of a matrix, I used this code first:

totalm <- data.matrix(total)

Followed by the following code (Davies-Bouldin index)

clusternumber<-0
max_cluster_number <- 30
#Davies Bouldin algorithm
library(clusterCrit)
smallest <-99999
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,b)
cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
print("##clusternumber##")
print(clusternumber)
print("##smallest##")
print(smallest)

I keep on getting this error:(list) object cannot be coerced to type 'double'. How can I solve this?

Reproducable example:

a <- c(0,0,1,0,1,0,0)
b <- c(0,0,1,0,0,0,0)
c <- c(1,1,0,0,0,0,1)
d <- c(1,1,0,0,0,0,0)

total <- cbind(a,b,c,d)
Prradep
  • 5,506
  • 5
  • 43
  • 84
cdvnmus
  • 13
  • 3

1 Answers1

1

The error is coming from cl<-as.numeric(cl). The result of a call to kmeans is an object, which is a list containing various information about the model.

Run ?kmeans

I would also recommend you add nstart = 20 to your kmeans call. k-means clustering is a random process. This will run the algorithm 20 times and find the best fit (i.e. for each number of centers).

for(b in 2:max_cluster_number){
    a <-99999
    for(i in 1:200){
        cl <- kmeans(totalm,centers = b,nstart = 20)
        #cl<-as.numeric(cl)
        intCriteria(totalm,cl$cluster,c("dav"))
        if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
            a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
    }
    if(a<smallest){
        smallest <- a
        clusternumber <-b
    }
}

This gave me

[1] "##clusternumber##"   
[1] 4
[1] "##smallest##"
[1] 0.138675

(tempoarily changing max clusters to 4 as reproducible data is a small set)

EDIT Integer Error

I was able to reproduce your error using

a <- as.integer(c(0,0,1,0,1,0,0))
b <- as.integer(c(0,0,1,0,0,0,0))
c <- as.integer(c(1,1,0,0,0,0,1))
d <- as.integer(c(1,1,0,0,0,0,0))

totalm <- cbind(a,b,c,d)

So that an integer matrix is created.

I was then able to remove the error by using

storage.mode(totalm) <- "double"

Note that

total <- cbind(a,b,c,d)
totalm <- data.matrix(total)

is unnecessary for the data in this example

> identical(total,totalm)
[1] TRUE
Jeremy Voisey
  • 1,257
  • 9
  • 13
  • Thanks! There is only one more issue when I'm trying this on my real dataset. Then I get this: Error in intCriteria(totalm, cl$cluster, c("dav")) : REAL() can only be applied to a 'numeric', not a 'integer' – cdvnmus Apr 29 '17 at 08:40
  • Can you edit your example to send a sample of your dataset. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example e.g. dput(head(totalm,20)) if it's not too much data – Jeremy Voisey Apr 29 '17 at 09:08