9

I need to calculate the centroids of a set of spatial zones based on a separate population grid dataset. Grateful for a steer on how to achieve this for the example below.

Thanks in advance.

require(raster)
require(spdep)
require(maptools)

dat <- raster(volcano)   # simulated population data
polys <- readShapePoly(system.file("etc/shapes/columbus.shp",package="spdep")[1])

# set consistent coordinate ref. systems and bounding boxes
proj4string(dat) <- proj4string(polys) <- CRS("+proj=longlat +datum=NAD27")
extent(dat) <- extent(polys)

# illustration plot
plot(dat, asp = TRUE)
plot(polys, add = TRUE)

enter image description here

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
geotheory
  • 22,624
  • 29
  • 119
  • 196
  • So for each polygon you want the coordinate that is the weighted (by the raster cell) average location of the cells in that polygon, yup? The centre of mass, where the raster is the mass? – Spacedman May 12 '14 at 16:11
  • Yes, basically the population weighted centroid of each polygon, which I guess would be the 2d equivalent of a solid object's centre of mass. The [closest google match](https://stat.ethz.ch/pipermail/r-sig-geo/2012-March/014426.html) doesn't resolve it. I'd presume a function already exists for this in some package. – geotheory May 12 '14 at 16:26
  • 1
    Great reproducible example, by the way. I'll try to remember to link to this in the future as an example of what those look like! – Josh O'Brien May 13 '14 at 15:25
  • Thanks for the feedback Josh. It's also a useful process to go through in itself - I often solve problems I plan to post in SO while trying to reproduce them as toy models. – geotheory May 13 '14 at 15:44

4 Answers4

5

Three steps:

First, find all the cells in each polygon, return a list of 2-column matrices with the cell number and the value:

require(plyr) # for llply, laply in a bit...
cell_value = extract(dat, polys,cellnumbers=TRUE)
head(cell_value[[1]])
     cell value
[1,]   31   108
[2,]   32   108
[3,]   33   110
[4,]   92   110
[5,]   93   110
[6,]   94   111

Second, turn into a list of similar matrices but add the x and y coords:

cell_value_xy = llply(cell_value, function(x)cbind(x,xyFromCell(dat,x[,"cell"])))
head(cell_value_xy[[1]])
     cell value        x        y
[1,]   31   108 8.581164 14.71973
[2,]   32   108 8.669893 14.71973
[3,]   33   110 8.758623 14.71973
[4,]   92   110 8.581164 14.67428
[5,]   93   110 8.669893 14.67428
[6,]   94   111 8.758623 14.67428

Third, compute the weighted mean coordinate. This neglects any edge effects and assumes all grid cells are the same size:

centr = laply(cell_value_xy, function(m){c(weighted.mean(m[,3],m[,2]), weighted.mean(m[,4],m[,2]))})
head(centr)
            1        2
[1,] 8.816277 14.35309
[2,] 8.327463 14.02354
[3,] 8.993655 13.82518
[4,] 8.467312 13.71929
[5,] 9.011808 13.28719
[6,] 9.745000 13.47444

Now centr is a 2-column matrix. In your example its very close to coordinates(polys) so I'd make a contrived example with some extreme weights to make sure its working as expected.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Thanks Spaced, but I have my doubts this is working as intended. For example look at where the points plot(i.e. `points(centr, pch=16, cex=1, col='red')`). If we increase the contrast in the data by using `dat = raster((volcano-92)^1.5/10)` you'd expect to see the centroids much closer to green but they still sit a bit too centrally in the polygons to my view. Look at top polygon for instance. – geotheory May 12 '14 at 17:54
  • @geotheory -- Maybe check a few of the computations "by hand". A quick glance at the range of values within each polygon (in many cases less than +/- 5%) makes *me* suspect that the weighting isn't going to have much effect on the centroid locations. I'd have to dig a little bit deeper to make sure, but Spacedman's code sure look good to me, and anyways, that's your job, right? ;) (Also, of course, +1). – Josh O'Brien May 12 '14 at 18:02
  • To test, first use a flat raster to see if my points equal the centroids (mod some edge effects). Then make a polygon which is south-half 0 and north-half 1 raster cells, and check my centroid is in the north-half... – Spacedman May 12 '14 at 18:16
  • Yes checking computations it seems I was guestimating badly - perhaps different data would make a better example. Both your methods give precisely the same points, for which I am most grateful. I like the elegance of rasterize/zonal, but Spaced probably wins for 'Ronseal' clarity. – geotheory May 12 '14 at 19:41
5

Another alternative.

I like it for its compactness, but it will likely only make sense if you're fairly familiar with the full family of raster functions:

## Convert polygons to a raster layer
z <- rasterize(polys, dat)

## Compute weighted x and y coordinates within each rasterized region
xx <- zonal(init(dat, v="x")*dat, z) / zonal(dat,z)
yy <- zonal(init(dat, v="y")*dat, z) / zonal(dat,z)

## Combine results in a matrix
res <- cbind(xx[,2],yy[,2])
head(res)
#          [,1]     [,2]
# [1,] 8.816277 14.35309
# [2,] 8.327463 14.02354
# [3,] 8.993655 13.82518
# [4,] 8.467312 13.71929
# [5,] 9.011808 13.28719
# [6,] 9.745000 13.47444
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 2
    Increasing familiarity with the `raster` package functions can only be a good thing. I looked at the list of functions in `raster` and didn't see anything obvious - we need a better cheatsheet! – Spacedman May 13 '14 at 07:56
  • @Spacedman -- Agreed. I only discovered `init` when I came across it in Etienne Racine's [Visual Raster Cheat Sheet](http://rpubs.com/etiennebr/visualraster). It (the cheat sheet) is nifty, but it still misses a lot of the good stuff that's in that package --- especially its suite of functions for dealing with spatial vector data types. I'm guessing very few folks know about its `aggregate()`, `cover()`, `bind()`, `symdif()`, `union()`, `intersect()`, and `erase()` functions. (Those last three can alternatively be called via overloaded `"+"`, `"*"`, and `"-"` operators, which is kind of fun.) – Josh O'Brien May 13 '14 at 14:46
1

The answers by Spacedman and Josh are really great, but I'd like to share two other alternatives which are relatively fast and simple.

library(data.table)
library(spatialEco)
library(raster)
library(rgdal)

using a data.table approach:

# get centroids of raster data
  data_points <- rasterToPoints(dat, spatial=TRUE)

# intersect with polygons
  grid_centroids <- point.in.poly(data_points, polys)

# calculate weighted centroids
  grid_centroids <- as.data.frame(grid_centroids)
  w.centroids <- setDT(grid_centroids)[, lapply(.SD, weighted.mean, w=layer), by=POLYID, .SDcols=c('x','y')]

using wt.centroid{spatialEco} :

  # get a list of the ids from each polygon
    poly_ids <- unique(grid_centroids@data$POLYID)

  # use lapply to calculate the weighted centroids of each individual polygon
    w.centroids.list <- lapply(poly_ids, function(i){wt.centroid( subset(grid_centroids, grid_centroids@data$POLYID ==i)
                                                                  , 'layer', sp = TRUE)} )
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
0

My own less elegant solution below. Gives exactly the same results as Spacedman and Josh.

# raster to pixels
p = rasterToPoints(dat) %>% as.data.frame()
coordinates(p) = ~ x + y
crs(p) = crs(polys)

# overlay pixels on polygons
ol = over(p, polys) %>% mutate(pop = p$layer) %>% cbind(coordinates(p)) %>% 
  filter(COLUMBUS_ %in% polys$COLUMBUS_) %>%     # i.e. a unique identifier
  dplyr::select(x, y, pop, COLUMBUS_) %>% as_data_frame()

# weighted means of x/y values, by pop
pwcs = split(ol, ol$COLUMBUS_) %>% lapply(function(g){
  data.frame(x = weighted.mean(g$x, g$pop), y = weighted.mean(g$y, g$pop))
}) %>% bind_rows() %>% as_data_frame()
geotheory
  • 22,624
  • 29
  • 119
  • 196