3

I am a beginner with R, but an expert using Esri's ArcGIS.

I would like to use R to run exploratory Grouping/Cluster Analyses like Arc's 10.1 tool here.

The end product required must be a map visualization. I found this thread on hierarchical cluster analysis here on SO. Is this the same type of data grouping analysis as Esri's? The ArcGIS tool provides much flexibility with the parameters and I hope to replicate this functionality with R.

Again, I am an R beginner. Any info, suggestions, or advice is much appreciated.

thanks, mike

Community
  • 1
  • 1
mikeLdub
  • 31
  • 3
  • 2
    Hi Mike, welcome to SO. I think you will need to provide some more detail. You stand a much greater chance of someone helping you if you show what you have *tried* to do already, i.e post some code, and a greater chance still if you make your example [reproducible](http://stackoverflow.com/q/5963269/1478381). Questions that could loosely be phrased as *how do I write some code to do this* tend not to do so well here. Just a heads up. – Simon O'Hanlon Mar 12 '13 at 15:55
  • 2
    `When NO_SPATIAL_CONSTRAINT is specified, the Grouping Analysis tool uses a K Means algorithm.` From ArcGIS Help.... So take a look a k-means-clustering. – EDi Mar 12 '13 at 16:04
  • Thanks for the heads up Simon. I am currently working on the problem at hand and will try to post something reproducible soon. But since I'm such an R novice, I was hoping to get some kind of validation that the cluster analysis might be able to reproduce the GIS I'm trying to mimic. – mikeLdub Mar 12 '13 at 16:05
  • Thanks EDi! I'll dive in with k-means-clustering. – mikeLdub Mar 12 '13 at 16:06

1 Answers1

3

As best as I can tell this is a simple KNN analysis. The alternative "no distance matrix" component that the ESRI help explains seems quite undesirable. Basically they are using a K-means clustering with a region growing approach using random seeding. This seems very unstable and could return highly variable results. It seems like they are performing a bit of maneuvering to avoid some issues like disconnected regions so it may take some doing to exactly recreate their results. You can approximate the "spatially constrained" option in spdep. Here is a brief example of a distance analysis that will give you a starting point. Keep in mind that in order to assign "classes" you will need to set up some type of looping structure.

require(sp)
require(spdep)

data(meuse) 
coordinates(meuse) <- ~x+y

# Create distance matrix of specified range
meuse.dist <- dnearneigh(coordinates(meuse), 0.0001, 1000)

# Coerce distance object to a list object with distances for each observation
dist.list <- nbdists(meuse.dist, coordinates(meuse))

# Create a new column with the distance to the nearest observation using lapply and unlist
meuse@data <- data.frame(meuse@data, NNDist=unlist(lapply(dist.list, FUN=function(x) min(x))))

# Plot results
spplot(meuse, "NNDist", col.regions=colorRampPalette(c("blue","yellow","red"),
       interpolate="spline")(10)  )

You may want to also explore Hierarchical Clustering. However, for larger data sets, hclust needs a triangular distance matrix whereas dnearneigh does not. Here is an example using constrained hierarchical clustering.

# SPATIALLY CONSTRAINED CLUSTERING
require(sp)
require(rioja)

data(meuse)
  coordinates(meuse) <- ~x+y
    cdat <- data.frame(x=coordinates(meuse)[,1],y=coordinates(meuse)[,2])
      rownames(cdat) <- rownames(meuse@data)

chc <- chclust(dist(cdat), method="conslink")

# KNN
  chc.n3 <- cutree(chc, k=3) 

# DISTANCE  
  chc.d200 <- cutree(chc, h=200) 

meuse@data <- data.frame(meuse@data, KNN=as.factor(chc.n3), DClust=chc.d200)

opar <- par
  par(mfcol=c(1,2))  
   cols <- topo.colors(length(unique(meuse@data$KNN)))  
    color <- rep("xx", nrow(meuse@data))
      for(i in 1:length(unique(meuse@data$KNN))) {
        v <- unique(meuse@data$KNN)[i] 
          color[(meuse@data$KNN == v)] <- cols[i]
        }
    plot(meuse, col=color, pch=19, main="KNN Clustering")

    cols <- topo.colors(length(unique(meuse@data$DClust)))  
    color <- rep("xx", nrow(meuse@data))
      for(i in 1:length(unique(meuse@data$DClust))) {
        v <- unique(meuse@data$DClust)[i] 
          color[(meuse@data$DClust == v)] <- cols[i]
        }
    plot(meuse, col=color, pch=19, main="Distance Clustering")  
par <- opar
Jeffrey Evans
  • 2,325
  • 12
  • 18
  • this is super helpful, thanks for the help Jeff. I'll get back to you and let you know how it goes. – mikeLdub Mar 12 '13 at 18:48
  • @mikeLdub, I would not use what ESRI provides as a benchmark. I have been very unsatisfied with the "Spatial Statistics" toolbox. Many of the methodological implementations are less than optimal or down out flawed. Looking towards R for these type of analysis is a very good idea. – Jeffrey Evans Mar 12 '13 at 18:52