0

I'm sure this question has been answered elsewhere, but I have not been able to come up with it by searching.

I have points representing cities within a country along with population for each city. I also have a polygon file of counties. I want to find the location of the largest city within each county.

How can this be done?

Here is some data

structure(list(Country = c("us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us",
"us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us"), City = c("cabarrus", "cox store", "cal-vel", "briarwood townhouses", "barker heights", "davie
crossroads", "crab point village", "azalea", "chesterfield", "charlesmont", "connor", "clover garden", "corriher heights", "callisons", "crestview acres", "clegg", "canaan park", "chantilly", "belgrade", "brices crossroads", "bluff", "butner", "bottom", "bandy", "bostian heights"), AccentCity = c("Cabarrus", "Cox Store", "Cal-Vel", "Briarwood Townhouses", "Barker Heights", "Davie Crossroads", "Crab Point Village", "Azalea", "Chesterfield", "Charlesmont", "Connor", "Clover Garden", "Corriher Heights", "Callisons", "Crestview Acres", "Clegg", "Canaan Park", "Chantilly", "Belgrade", "Brices Crossroads", "Bluff", "Butner", "Bottom", "Bandy", "Bostian Heights"), Region = c("NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC"), Population = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, A_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), Latitude = (35.2369444, 35.275, 36.4291667, 35.295, 35.3111111, 35.8319444, 34.7602778, 35.58, 35.81, 5.9341667, 35.7419444, 36.1883333, 35.5605556, 35.0841667, 35.0213889, 35.8655556, 36.2761111, 36.3016667, 34.88, 34.8186111, 35.8377778, 36.1319444, 36.4747222, 35.6419444, 35.7544444), Longitude = c(-80.5419444, -82.0352778, -78.9694444, -81.5238889, -82.4441667, -80.535, -76.7305556, -82.4713889, -81.6611111, -81.5127778, -78.1486111, -79.4630556, -80.635, -76.7255556, -80.5427778, -78.8497222, -79.7852778, -76.1711111, -77.2352778, -78.1016667, -82.8580556, -78.7569444, -80.7741667, -81.09, -80.9294444)), .Names = c("Country", "City", "AccentCity", "Region", "Population", "Latitude", "Longitude"), row.names = c(544L, 889L, 551L, 434L, 190L, 975L, 894L, 147L, 717L, 700L, 831L, 773L, 862L, 559L, 915L, 753L, 584L, 695L, 262L, 437L, 372L, 537L, 406L, 178L, 02L), class = "data.frame")

and some code to read in north carolina

xx <- readShapePoly(system.file("shapes/sids.shp", package="maptools")[1],
                IDvar="FIPSNO", proj4string=CRS("+proj=longlat +ellps=clrk66"))

plot(xx)

I want to find the city with the maximum population within each county. i'm sorry I don't have a reproducible example. If I did, I would have the answer!

Pete
  • 321
  • 4
  • 16
  • It could be best answered if you provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) showing what you're attempting to accomplish – Rich Scriven Nov 22 '14 at 19:37
  • Post the output of `dput(data)` where `data` is your "some data". – jlhoward Nov 22 '14 at 20:06
  • Sorry, thanks for the info. I had no idea how to put data in the question. – Pete Nov 22 '14 at 20:09
  • Yes but you haven't done it. All I did was put the data you posted into a code block. That's not enough. Type `dput(data)` and copy/paste the result into the question. – jlhoward Nov 22 '14 at 20:10
  • I don't understand: in the data you provided, the `Population` is all `NA`?? BTW: You nave to address the comment to me (using @jlhoward), or I will not get a notification. – jlhoward Nov 24 '14 at 00:01
  • I agree with @jlhoward, that it is nonsense that all the Population entries are `NA`. Also, please change the one occurrence of `A_integer_` to `NA_integer_` in case somebody tries to read in the data. – Ege Rubak Nov 24 '14 at 07:30
  • @jlhoward Argh. I made a subset of the data and did not realize that there were missing populations. Again, my apologies. – Pete Nov 24 '14 at 17:44

1 Answers1

0

The short answer is that you should use gContains(...) in package rgeos.

Here is the long answer.

In the code below, we grab a high resolution shapefile of North Carolina counties from the GADM database, and a geocoded dataset of North Carolina cities from from the US Geological Survey database. The latter already has county information but we ignore that. Then we map cities to their appropriate county using gContains(...), add that information to the cities data frame, and identify the largest city in each county using the data.table package. Most of the work is in 4 lines of code near the end.

library(raster)   # for getData(...);   you may not need this
library(foreign)  # for read.dbf(...);  you may not need this
library(rgeos)    # for gContains(...); loads package sp as well

setwd("< directory for downloaded data >")
# get North Carolina Counties shapefile from GADM database
USA         <- getData("GADM",country="USA",level=2)   # level 2 is counties
NC.counties <- USA[USA$NAME_1=="North Carolina",]      # North Carolina Counties
# get North Carolina Cities data from USGS database
url <- "http://dds.cr.usgs.gov/pub/data/nationalatlas/citiesx010g_shp_nt00962.tar.gz"
download.file(url,"cities.tar.gz")
untar("cities.tar.gz")
data      <- read.dbf("citiesx010g.dbf",as.is=TRUE)
NC.data   <- data[data$STATE=="NC",c("NAME","COUNTY","LATITUDE","LONGITUDE","POP_2010")]
## --- evverything up to here is just to set up the example

# convert cities data.frame to SpatialPointsDataFrame
NC.cities <- SpatialPointsDataFrame(NC.data[,c("LONGITUDE","LATITUDE")],
                                    data=NC.data,
                                    proj4string=CRS(proj4string(NC.counties)))
# map cities to counties
city.cnty   <- gContains(NC.counties,NC.cities,byid=TRUE)
# add county information to cities data
NC.data$county <- apply(city.cnty,1,function(cnty)ifelse(any(cnty),NC.counties@data[cnty,]$NAME_2,NA))
# identify largest city in each county
library(data.table)
result <- setDT(NC.data)[,.SD[which.max(POP_2010)],by="county"]
head(result)
#      county             NAME   COUNTY LATITUDE LONGITUDE POP_2010
# 1:  Jackson        Cullowhee  Jackson 35.31371 -83.17653     6228
# 2:   Graham     Robbinsville   Graham 35.32287 -83.80740      620
# 3:   Wilkes North Wilkesboro   Wilkes 36.15847 -81.14758     4245
# 4:    Rowan        Salisbury    Rowan 35.67097 -80.47423    33662
# 5: Buncombe        Asheville Buncombe 35.60095 -82.55402    83393
# 6:    Wayne        Goldsboro    Wayne 35.38488 -77.99277    36437

The workhorse here is the line:

city.cnty   <- gContains(NC.counties,NC.cities,byid=TRUE)

This compares every point in the SpatialPointsDataFrame NC.Cities to every Polygon in the SpatialPolygonsDataFrame NC.counties and returns a logical matrix where tthe rows represent cities and the columns represent counties, and the [i,j] element is TRUE if city i is in county j, FALSE otherwise. We process the matrix row-wise in the next statement:

NC.data$county <- apply(city.cnty,1,function(cnty)ifelse(any(cnty),NC.counties@data[cnty,]$NAME_2,NA))

using each row in succession to index the attributes table for NC.counties to extract the county name.

The data you provided in your question has some problems which are nevertheless instructive. First, the NC shapefile in the maptools package is relatively low resolution. In particular this means that some of the coastal islands are completely missing, so any city on one of those islands will not map to a county. You might have the same problem with your real data so watch out for it.

Second, comparing the COUNTY column in the original USGS dataset with the county column which we added, there are 3 (out of 865) counties that do not agree. It turns out that, in those cases, the USGS database was wrong (or out of date). You might have the same problem so watch out for that too.

Third, an additional three cities did not map to any county. These were all coastal cities and probably reflect small inaccuracies in the North Carolina shapefile. You night have this problem as well.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Yes, this gives me a result for finding the largest city in a data frame where the county is noted. My particular application however assumes that the county is NOT known and can only be determined from the polygon data set. – Pete Nov 24 '14 at 17:48
  • So, what I would have is a coordinate file of points and populations and a shapefile of polyons. I want the locations of the points with the maximum population in each polygon. I tried using over(pts, pol) from the sp package, but it only gives me the points which are located withing the entire file. It doesn't tell me what points are in what counties. Is there a way to do this? Thank you for taking the time. I appreciate it. (In my case, I have cities in districts within African countries, but I used the NC as an easy to understand example). – Pete Nov 24 '14 at 17:53
  • Upload your real data somewhere (both the coordinates file and the full shapefile, which has multiple files), and post a link. – jlhoward Nov 24 '14 at 17:57
  • OK. So I completely reworked the answer to demonstrate how you would map cities to counties using the city coordinates and a county shapefile. – jlhoward Nov 26 '14 at 20:26