0

I am plotting some observations with two or more modes separated far apart. I would like to have a plot that could ignore the gap automatically. A simplified example of the observations would be

obs= c(rnorm(100, 0, 1), rnorm(100, sample(c(-1e6, 1e6), 1), 1))

I noticed that gap.plot() from the plotrix library can do similar like this but is there any way I could do it in plotly/ggplot without manually specifying the gap range? My gap is random due to a random sample extreme mean.

Davide Passaretti
  • 2,741
  • 1
  • 21
  • 32
Daves
  • 175
  • 1
  • 10

1 Answers1

0

How about just splitting the set like in this:

obs= c(rnorm(100, 0, 1), rnorm(100, sample(c(-1e6, 1e6), 1), 1))

histN <- function(x, n){
  clustRes <- kmeans(x,centers = n)
  par(mfrow=c(1,n))
  for(i in 1:n){
    hist(x[clustRes$cluster==i], xlab="", main = sprintf("subset = %d",i))
  }
}
dev.new()
histN(obs,2)

or as a more pretty but less flexible alternative:

hist2 <- function(x){
  clustRes <- kmeans(x,centers = 2)
  parDefault <- par()
  layout(matrix(c(1,1,1,2,3,3,3),nrow=1))
  idx1 <- which.min(clustRes$centers)
  idx2 <- which.max(clustRes$centers)
  h1 <- hist(x[clustRes$cluster==idx1],plot=FALSE)
  h2 <- hist(x[clustRes$cluster==idx2],plot=FALSE)
  yRange <- c(min(c(h1$counts,h2$counts)),max(c(h1$counts,h2$counts)))
  par(mai = c(parDefault$mai[1:3],0))
  par(cex = parDefault$cex)
  hist(x[clustRes$cluster==idx1],ylim= yRange, xlab="", ylab ="", main = sprintf("subset = %d",1),axes=FALSE)
  axis(1)
  axis(2)
  par(mai = c(parDefault$mai[1],0,parDefault$mai[3],0))
  par(cex = 3)
  plot(1:10,type="n", axes=FALSE, xlab="...", ylab ="")
  text(x= 5, y= 5, "...")
  axis(1, tick=FALSE,labels = FALSE)
  par(mai = c(parDefault$mai[1],0,parDefault$mai[3:4]))
  par(cex = parDefault$cex)
  hist(x[clustRes$cluster==idx2],ylim= yRange, xlab="", ylab ="", main = sprintf("subset = %d",2),axes=FALSE)
  axis(1)
  axis(4)
}
dev.new()
hist2(obs)
HolgerBarlt
  • 307
  • 1
  • 18
  • I really like your way of doing it using non supervisor learning techniques. Really nice thought! If you don't mind I make the question harder What if you don't actually know the number of modes/clusters that we actually have, how would you modify it in such a case? – Daves Mar 05 '19 at 18:53
  • Finding the right number of Clusters is a question by ist own and will exceed the topic of this question by far... see e.g. https://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-Clusters . So I guess in this case you should first solve the first Task of finding the right Clusters and than deal with the visualisation in a second step. My experience is that in case of 5 or more Clusters, histograms can get pretty messy... – HolgerBarlt Mar 06 '19 at 12:27