I have used ddply and gDistance (package: rgeos) to link two sets of polygons. The process is, however, very slow for my large dataset, as explained in a number of blogs (e.g. Why is plyr so slow?).
These blogs suggest that using data.table would be faster, though I cannot work out how to make it work for my case. In particular, how do I apply gDistance for only a subset of my habitat clearance data, within data.table...?
Below I explain in more detail what I am trying to do and attach a small subset of my data for testing code (here: https://www.dropbox.com/sh/9onaeltb81qrd7h/AAAdED0KS6n2EzP74F3o81Lxa?dl=0).
I have two shapefiles containing polygons (one of patches of habitat clearance and the other properties punished for clearing habitat), where I want to identify the polygon in the first layer (habitat clearance) which is nearest to each polygon in the second layer (punished properties), since the two do not always overlap exactly. Plus, there is one additional constraint; the matched habitat clearance polygon cannot be more than 5 years older than the environmental crime.
I have successfully used the following ddply code - any advice on how to do this with data.table, which I think will be faster...?
Thanks!
# # # # # # # # # # # # # # # # # # # Load [R] GIS packages. # # # # # # # # # # # # # # # # # #
library(rgeos)
library(raster)
library(rgdal)
library(plyr)
# # # # # # # # # # # # # # # # # # # ENTER shapefile information HERE # # # # # # # # # # # # # # # # # # # # # #
# What is the name of the punishments shapefile?
punishments <- "punishments_stack_overflow"
# What is the name of the PUNISHMENTs directory?
myDrctry <- "E:/Esri_arcGIS_datasets/SM_data/IBAMA_embargo/final_embargo_list/near_chopped/stack_overflow" # !CHANGE ME - where the shapefiles are stored
# What is the name of the hab_cl shapefile?
hab_cl_shp <- "RO_SimU_deforestation_Amazonia_SIRGAS_near"
# What is the name of the hab_cl data directory?
my_hab_cl_Drctry <- "E:/Esri_arcGIS_datasets/SM_data/PRODES/Deforestation_per_SimU/near_analysis" #! CHANGE ME
# # # # # # # # # # # # # # # # # # # # Load the shapefiles # # # # # # # # # # # # # # # # # # # # # #
# Read in the embargo shapefile
punishments_need_near <- readOGR(dsn=myDrctry, layer=punishments)
# Identify the attributes to keep
myattributes <- c("numero_tad", "data_tad", "CD_BIOMA")
# Subset the full dataset extracting only the desired attributes
punishments_need_near@data <- punishments_need_near@data[,myattributes]
# Load the deforestation data
hab_cl_patches_near <- readOGR(dsn=my_hab_cl_Drctry, layer=hab_cl_shp)
proj4string(hab_cl_patches_near) # check the projection (which is SIRGAS 2000 UTM)
#hab_cl_patches_near@data <- hab_cl_patches_near@data[,c("year","LAPIG_ID")] # manipulate the columns to match oter dataframes
#names(hab_cl_patches_near@data) <- c("ano", "PRODES_ID")
head(hab_cl_patches_near) # check that it worked
# # # # # # # # # # # # # # # # # # # # # # # Run the loop # # # # # # # # # # # # # # # # # # # # # # #
# Use ddply to calculate nearest distance for each embargo ("numero_tad")
tmp <- ddply(punishments_need_near@data, .(numero_tad), function(x) { # numero_tad is a unique identifier per punishment
ID <- x$numero_tad[1]
tmp.punishments <- punishments_need_near[punishments_need_near@data$numero_tad == ID,]
tmp.patches <- hab_cl_patches_near[(hab_cl_patches_near@data$ano +5) >= tmp.punishments@data[,"ano_new"] & # match the punishments with habitat clearance in last 5 years (ano_new = yr of punishment, ano = yr of habitat clearance)
hab_cl_patches_near@data$ano <= tmp.punishments@data[,"ano_new"],] # and not after the punishment itself
obj <- gDistance(tmp.punishments, tmp.patches, byid=TRUE) # calculate the distance between each punishment and patch of habitat clearance
df <- data.frame(numero_tad = ID, PRODES_ID = tmp.patches$PRODES_ID[which.min(obj)], dist = min(obj)) # link punishment with the nearest suitable patch of habitat clearance
}, .progress='text') # progress bar
head(tmp)