0

I have used ddply and gDistance (package: rgeos) to link two sets of polygons. The process is, however, very slow for my large dataset, as explained in a number of blogs (e.g. Why is plyr so slow?).

These blogs suggest that using data.table would be faster, though I cannot work out how to make it work for my case. In particular, how do I apply gDistance for only a subset of my habitat clearance data, within data.table...?

Below I explain in more detail what I am trying to do and attach a small subset of my data for testing code (here: https://www.dropbox.com/sh/9onaeltb81qrd7h/AAAdED0KS6n2EzP74F3o81Lxa?dl=0).

I have two shapefiles containing polygons (one of patches of habitat clearance and the other properties punished for clearing habitat), where I want to identify the polygon in the first layer (habitat clearance) which is nearest to each polygon in the second layer (punished properties), since the two do not always overlap exactly. Plus, there is one additional constraint; the matched habitat clearance polygon cannot be more than 5 years older than the environmental crime.

I have successfully used the following ddply code - any advice on how to do this with data.table, which I think will be faster...?

Thanks!

# # # # # # # # # # # # # # # # # # # Load [R] GIS packages. # # # # # # # # # # # # # # # # # # 
library(rgeos)
library(raster)
library(rgdal)
library(plyr)

# # # # # # # # # # # # # # # # # # # ENTER shapefile information HERE # # # # # # # # # # # # # # # # # # # # # #


# What is the name of the punishments shapefile?
punishments <- "punishments_stack_overflow"  


# What is the name of the PUNISHMENTs directory?
myDrctry <- "E:/Esri_arcGIS_datasets/SM_data/IBAMA_embargo/final_embargo_list/near_chopped/stack_overflow" # !CHANGE ME - where the shapefiles are stored


# What is the name of the hab_cl shapefile?
hab_cl_shp <- "RO_SimU_deforestation_Amazonia_SIRGAS_near" 


# What is the name of the hab_cl data directory?
my_hab_cl_Drctry <- "E:/Esri_arcGIS_datasets/SM_data/PRODES/Deforestation_per_SimU/near_analysis" #! CHANGE ME




# # # # # # # # # # # # # # # # # # # # Load the shapefiles  # # # # # # # # # # # # # # # # # # # # # # 


# Read in the embargo shapefile
punishments_need_near <- readOGR(dsn=myDrctry, layer=punishments)


# Identify the attributes to keep
myattributes <- c("numero_tad", "data_tad", "CD_BIOMA")


# Subset the full dataset extracting only the desired attributes
punishments_need_near@data <- punishments_need_near@data[,myattributes]


# Load the deforestation data 
hab_cl_patches_near <- readOGR(dsn=my_hab_cl_Drctry, layer=hab_cl_shp) 
proj4string(hab_cl_patches_near)               # check the projection (which is SIRGAS 2000 UTM)
#hab_cl_patches_near@data <- hab_cl_patches_near@data[,c("year","LAPIG_ID")] # manipulate the columns to match oter dataframes
#names(hab_cl_patches_near@data) <- c("ano", "PRODES_ID")
head(hab_cl_patches_near)                                                   # check that it worked 


# # # # # # # # # # # # # # # # # # # # # # # Run the loop # # # # # # # # # # # # # # # # # # # # # # # 


# Use ddply to calculate nearest distance for each embargo ("numero_tad")
tmp <- ddply(punishments_need_near@data, .(numero_tad), function(x) {    # numero_tad is a unique identifier per punishment
  ID <- x$numero_tad[1]
  tmp.punishments <- punishments_need_near[punishments_need_near@data$numero_tad == ID,]                    
  tmp.patches <- hab_cl_patches_near[(hab_cl_patches_near@data$ano +5) >= tmp.punishments@data[,"ano_new"] &     # match the punishments with habitat clearance in last 5 years (ano_new = yr of punishment, ano = yr of habitat clearance)
                                     hab_cl_patches_near@data$ano <= tmp.punishments@data[,"ano_new"],]          # and not after the punishment itself
  obj <- gDistance(tmp.punishments, tmp.patches, byid=TRUE)                                                      # calculate the distance between each punishment and patch of habitat clearance
  df  <- data.frame(numero_tad = ID, PRODES_ID = tmp.patches$PRODES_ID[which.min(obj)], dist = min(obj))    # link punishment with the nearest suitable patch of habitat clearance
}, .progress='text') # progress bar

head(tmp)
Community
  • 1
  • 1
  • Sounds like an opportunity to use `.BY`. See https://github.com/Rdatatable/data.table/issues/1363 for some examples of such – MichaelChirico Mar 02 '16 at 15:29

1 Answers1

0

Your polygons are stored in Spatial Polygons Data Frame class objects. data.table only works with data.table type objects. You need convert the data subset created by your data.table to a Spatial Polygons Data Frame.

gDistance requires Spatial objects but it is receiving data table objects.

To work around this, convert the column(s) corresponding to your polygons back to polygons within the aggregation function.

Your Dropbox link is broken, but here's an example using Spatial Points (instead of Polygons). I assume the same approach applies.

library(sp)
library(rgeos)

punishments <- data.frame(numero_tad = sample(1:3, 10, replace=TRUE), ano_new = sample(1:10, 10, replace=TRUE), x = sample(1:100, 10), y = sample(1:100, 10) )
patches <- data.frame(PRODES_ID = 1:10, ano = sample(1:10, 10, replace=TRUE), x = sample(1:100, 10), y = sample(1:100, 10) )
head(punishments)
head(patches)

coordinates(punishments) <- ~x+y # convert to Spatial object
coordinates(patches) <- ~x+y # convert to Spatial object

class(punishments) # "SpatialPointsDataFrame"
head(punishments)

link <- function(SD) {
  coordinates(SD) <- ~x+y # convert from data.table to Spatial object
  # you'll need to do something like to above to convert back to a polygon
  tmp.patches <- patches[(patches$ano + 5) >= SD$ano_new & patches$ano < SD$ano_new, ]
  obj <- gDistance(SD, tmp.patches, byid=TRUE)
  df <- list(PRODES_ID = tmp.patches$PRODES_ID[which(obj == min(obj), arr.ind=TRUE)[1]], dist = min(obj) )
  return(df)
}

dt <- as.data.table(punishments)
class(dt)
head(dt)
setkey(dt, numero_tad)

tmp <- dt[,link(.SD), by=numero_tad]

Hope that helps!

Mark Egge
  • 638
  • 8
  • 9