1

So I have two shapefile layers of U.S. Census tracts with data I need to merge, but unfortunately one of the files uses a peculiar ID system that does not match up with the other file. I would like to merge these data into one layer, but I'm struggling to figure out how to do this in R (in QGIS, I can do this with the "Data Management Tools > Join Attributes by Location" function. Below is some code used to create two polygon layers, now I'd like to determine how to combine these two layers into one which retains all underlying data from both layers.

library(sp)

# Create FIRST polygon layer
square <- rbind(c(255842.4, 4111578, 255862.4, 4111578, 255862.4, 4111558, 
                  255842.4, 4111558, 255842.4, 4111578, 255842.4, 4111578),
                c(257397.0, 4111309, 257417.0, 4111309, 257417.0, 4111289, 
                  257397.0, 4111289, 257397.0, 4111309, 257397.0, 4111309))

ID <- c("C1", "C2")
people <- c(5000, 3000)

polys <- SpatialPolygons(list(
  Polygons(list(Polygon(matrix(square[1, ], ncol=2, byrow=TRUE))), ID[1]),
  Polygons(list(Polygon(matrix(square[2, ], ncol=2, byrow=TRUE))), ID[2])
))

# Create a dataframe and display default rownames
( p.df <- data.frame( ID=1:length(polys)) ) 
rownames(p.df)

# Try to coerce to SpatialPolygonsDataFrame (will throw error)
polys <- SpatialPolygonsDataFrame(polys, p.df) 

# Extract polygon ID's
( pid <- sapply(slot(polys, "polygons"), function(x) slot(x, "ID")) )

# Create dataframe with correct rownames
( p.df <- data.frame( ID=1:length(polys), row.names = pid) )    

# Try coersion again and check class
p <- SpatialPolygonsDataFrame(polys, p.df)
class(p) 

# Now we can add a column
p@data$people <- c(7500,3000) 


##################### NOW, creating the SECOND polygon layer
square2 <- rbind(c(255842.4, 4111578, 255862.4, 4111578, 255862.4, 4111558, 
                  255842.4, 4111558, 255842.4, 4111578, 255842.4, 4111578),
                c(257397.0, 4111309, 257417.0, 4111309, 257417.0, 4111289, 
                  257397.0, 4111289, 257397.0, 4111309, 257397.0, 4111309))

ID2 <- c("30067893", "30794385")

polys2 <- SpatialPolygons(list(
  Polygons(list(Polygon(matrix(square[1, ], ncol=2, byrow=TRUE))), ID2[1]),
  Polygons(list(Polygon(matrix(square[2, ], ncol=2, byrow=TRUE))), ID2[2])
))

# Create a dataframe and display default rownames
( p2.df <- data.frame( ID=1:length(polys2)) ) 
rownames(p2.df)

# Try to coerce to SpatialPolygonsDataFrame (will throw error)
polys2 <- SpatialPolygonsDataFrame(polys2, p2.df) 

# Extract polygon ID's
( pid2 <- sapply(slot(polys2, "polygons"), function(x) slot(x, "ID")) )

# Create dataframe with correct rownames
( p2.df <- data.frame( ID=1:length(polys2), row.names = pid2) )    

# Try coersion again and check class
p2 <- SpatialPolygonsDataFrame(polys2, p2.df)
class(p2) 

# Now we can add a column
p2@data$homes <- c(500,250) 

Now, I'd like to merge the data together to have one SpatialPolygonsDataFrame which has both "homes" and "people" as columns. This post is asking something similar, however, imagine that the ID column does not match up between the two layers. Anyone have a good notion for how I could go about doing that?

JohnnyJohnson
  • 51
  • 1
  • 7

1 Answers1

2

Let's first use a simpler way to create some example data. Here with "terra"

library(terra)
sq1a <- matrix(c(255842.4, 4111578, 255862.4, 4111578, 255862.4, 4111558, 
              255842.4, 4111558, 255842.4, 4111578), ncol=2, byrow=TRUE)
sq1b <- matrix(c(257397.0, 4111309, 257417.0, 4111309, 257417.0, 4111289, 
              257397.0, 4111289, 257397.0, 4111309), ncol=2, byrow=TRUE)
square <- rbind(cbind(1,1,sq1a,0), cbind(2,1,sq1b,0))
p1 <- vect(square, type="polygons", atts=data.frame(ID=1:2, people=c(3000,7500)))

p2 <- p1
values(p2) <- data.frame(ID=1:2, homes=c(250,500))

Since the IDs match, you can do

m <- merge(p1, values(p2), by="ID")
m
# class       : SpatVector 
# geometry    : polygons 
# dimensions  : 2, 3  (geometries, attributes)
# extent      : 255842.4, 257417, 4111289, 4111578  (xmin, xmax, ymin, ymax)
# coord. ref. :  
# names       :    ID people homes
# type        : <int>  <num> <num>
# values      :     1   3000   250
#                   2   7500   500

Now, we make the IDs non-matching

 values(p2) <- data.frame(ID=c("01","02"), homes=c(250,500))

The obvious solution should be to fix the IDs, but if that is not possible you could try

i <- intersect(p1, p2)
i
# class       : SpatVector 
# geometry    : polygons 
# dimensions  : 2, 4  (geometries, attributes)
# extent      : 255842.4, 257417, 4111289, 4111578  (xmin, xmax, ymin, ymax)
# coord. ref. :  
# names       :    ID people    ID homes
# type        : <int>  <num> <chr> <num>
# values      :     1   3000    01   250
#                   2   7500    02   500

That works well for the example data, but the problem with this approach is that if the polygons are not exactly the same you can end up with a lot of cleaning work to get rid of small polygons that you do not want.

Below is a computationally much cheaper solution that does not have this problem.

e <- extract(p2, centroids(p1, inside=TRUE))
p <- cbind(p1, e[,-1])
p
# class       : SpatVector 
# geometry    : polygons 
# dimensions  : 2, 4  (geometries, attributes)
# extent      : 255842.4, 257417, 4111289, 4111578  (xmin, xmax, ymin, ymax)
# coord. ref. :  
# names       :    ID people  ID.1 homes
# type        : <int>  <num> <chr> <num>
# values      :     1   3000    01   250
#                   2   7500    02   500

If there is no a one-to-one relationship like in this example, then you would need to use merge instead of cbind

x <- p1
x$id.y <- e$id.y
x <- merge(x, e, by="id.y")
x$id.y <- NULL

m is a SpatVector. You can coerce it to an sf object with sf::st_as_sf(m) or to a Spatial object with library(raster); as(m, "Spatial")

Robert Hijmans
  • 40,301
  • 4
  • 55
  • 63
  • Currently trying your method with my data, two large files though so it's taking a while. I apologize for this, but I realized this afternoon that I suggested the incorrect QGIS function I was looking to replicate, I actually meant "Join Attributes by Location" (I edited original post). Does that change your suggestion of the "intersect()" function? – JohnnyJohnson Jul 04 '22 at 02:35
  • I have added a simpler approach – Robert Hijmans Jul 04 '22 at 16:06
  • It looks like I can this example to work using our toy data we created, but when I try and extend this analysis to the Census tract data I'm working with, I'm left with a data frame that has one row, the columns of the 'p2' data set, and all data is 'NA'. Any thoughts why this could be? I've loaded the data into QGIS and I'm confident the polygons of these layers overlap for the most part. Both vector layers are being stored as SpatVectors prior to running the extract command. – JohnnyJohnson Jul 05 '22 at 20:36
  • Perhaps they likely have a different coordinate reference system? E.g. one has longlat, and the other has ... – Robert Hijmans Jul 05 '22 at 23:32
  • That was it! Corrected the CRS and works perfectly now, thanks for your help. – JohnnyJohnson Jul 06 '22 at 05:20