3

I have a mixed type data set, so I wanted to try kamila clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.

data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
            calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)

It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?

kangaroo_cliff
  • 6,067
  • 3
  • 29
  • 42

1 Answers1

3

In the kamila package documentation

Setting calcNumClust to ’ps’ uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes.

In the case, you are using it, you have specified only one value to numClust. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.

To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7 and also the method for selecting the number of clusters.

If you also want to select the number of clusters, something like the following might work.

kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10, 
          calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)

Information on the selection of the number of clusters is now present in kamRes$nClust, and plot(2:7, kamRes$nClust$psValues) could be what you are after.

kangaroo_cliff
  • 6,067
  • 3
  • 29
  • 42
  • Thank you very much Suren. I am trying the code you provided but it gives this error. _"Error in kamila(conVar = conVar[testInd, ], catFactor = catFactor[testInd, : Input datasets must be dataframes"_ – Emrah BILGIC May 25 '18 at 22:25
  • The error says `Input datasets must be dataframes`... Since I do not know what data you are using, it is difficult say exactly what happens. – kangaroo_cliff May 25 '18 at 22:29
  • Looking at your code `catVarsFac` coud be a list since you are gerring it from an `lapply`. – kangaroo_cliff May 25 '18 at 22:30
  • Thank you I am trying to fix the problem. – Emrah BILGIC May 25 '18 at 22:39
  • I am using a dataset contains 8 binary variables and just one continuous variable. So as you see in the code conVars is scaled and it is data frame. I also make catVarsFac a data frame with `catVarsFac[] <- data.frame(lapply(catVarsFac, factor))` and it still gives the same error. I could not fix it. You know what when I write just `kamRes <- kamila(conVars, catVarsFac, numClust=5,numInit=10)` it is working. – Emrah BILGIC May 25 '18 at 23:27
  • What you are doing doesn't make sense. Try this `catVarsFac <- sapply(catVarsFac, factor); catVarsFac <- as.data.frame(catVarsFac)`. – kangaroo_cliff May 25 '18 at 23:32
  • If that doesn't work, read this on [how to produce a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – kangaroo_cliff May 25 '18 at 23:33