1

I have a directory with a bunch of shapefiles for 50 cities (and will accumulate more). They are divided into three groups: cities' political boundaries (CityA_CD.shp, CityB_CD.shp, etc.), neighborhoods (CityA_Neighborhoods.shp, CityB_Neighborhoods.shp, etc.), and Census blocks (CityA_blocks.shp, CityB_blocks.shp, etc.). They use common file-naming syntaxes, have the same set of attribute variables, and are all in the same CRS. (I transformed all of them as such using QGIS.) I need to write a list of each group of files (political boundaries, neighborhoods, blocks) to read as sf objects and then bind the rows to create one large sf object for each group. However I am running into consistent problems developing this workflow in R.

library(tidyverse)
library(sf)
library(mapedit)

# This first line succeeds in creating a character string of the files that match the regex pattern.
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)

# This second line creates a list object from the files.
shapefile_list <- lapply(filenames, st_read)

# This third line (adopted from https://github.com/r-spatial/sf/issues/798) fails as follows.
districts <- mapedit:::combine_list_of_sf(shapefile_list)
Error: Column `District_I` cant be converted from character to numeric

# This fourth line fails in an apparently different way (also adopted from https://github.com/r-spatial/sf/issues/798).
districts <- do.call(what = sf:::rbind.sf, args = shapefile_list)
Error in CPL_get_z_range(obj, 2) : z error - expecting three columns;

The first error appears to be indicating that one of my shapefiles has an incorrect variable class for the common variable District_I but R provides no information to clue me into which file is causing the error.

The second error seems to be looking for a z coordinate but is only finding x and y in the geometry attribute.

I have four questions on this front:

  1. How can I have R identify which list item it is attempting to read and bind is causing an error that halts the process?
  2. How can I force R to ignore the incompatibility issue and coerce the variable class to character so that I can deal with the variable inconsistency (if that's what it is) in R?
  3. How can I drop a variable entirely from the read sf objects that is causing an error (i.e. omit District_I for all read_sf calls in the process)?
  4. More generally, what is going on and how can I solve the second error?

Thanks all as always for your help.

P.S.: I know this post isn't "reproducible" in the desired way, but I'm not sure how to make it so besides copying the contents of all my shapefiles. If I'm mistaken on this point, I'd gladly accept any wisdom on this front.

UPDATE: I've run

filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)
shapefile_list <- lapply(filenames, st_read)
districts <- mapedit:::combine_list_of_sf(shapefile_list)

successfully on a subset of three of the shapefiles. So I've confirmed that there is some class conflict between the column District_I in one of the files causing the hold-up when running the code on the full batch. But again, I need the error to identify the file name causing the issue so I can fix it in the file OR need the code to coerce District_I to character in all files (which is the class I want that variable to be in anyway).

A note, particularly regarding Pablo's recommendation:

districts <- do.call(what = dplyr::rbind_all, shapefile_list)

results in an error Error in (function (x, id = NULL) : unused argument

followed by a long string of digits and coordinates. So,

mapedit:::combine_list_of_sf(shapefile_list)

is definitely the mechanism to read from the list and merge the files, but I still need a way to diagnose the source of the column incompatibility error across shapefiles.

Abe
  • 393
  • 2
  • 13
  • To make you example reproducible, you can try the dput() function. Create a list with two sf data frames and do dput(yourlist). Then you can paste the output in your question. – Pablo Herreros Cantis Jun 17 '20 at 00:30
  • Pablo, thanks for this tip. The problem is the output of dput() exceeds the maximum of 1000 lines allowed in the Console. I tried to amend this with ```rstudioapi::writeRStudioPreference("console_max_lines", 300)```, but get ```Error: Function writeRStudioPreference not found in RStudio```. (See: https://community.rstudio.com/t/more-than-1000-lines-output-in-r-studio-console/3288/9) – Abe Jun 17 '20 at 00:42
  • then you can take two of your data frames and subset them to their top 100 rows, for example. Something like ```df1<-df1[1:100,]``` – Pablo Herreros Cantis Jun 17 '20 at 00:45
  • posted a couple options, see if anything helps! – Pablo Herreros Cantis Jun 17 '20 at 00:54
  • Maybe I'm missing something, but I think such a reprex is unsuited to the issues I'm having. I will only know which ```df``` to post as a sample when I know which one is causing the process to stop...which requires going through loading each one manually to detect an issue. Part of my question is how I can automate this detection (Question 1 above). (I apologize if I didn't state that clearly enough.) Also, I tried using ```sink()``` to drop the entirety of the ```shapefile_list``` into the question and it crashed the page, as the list is almost 500,000 lines. – Abe Jun 17 '20 at 15:05
  • 1
    So the code above works between SOME of your shapefiles? a subset will help determine if the code actually works but there is some faulty data. – Pablo Herreros Cantis Jun 17 '20 at 15:09
  • 1
    See if this post is of help to change the class of the District I column. https://community.rstudio.com/t/simplest-way-to-modify-the-same-column-in-multiple-dataframes-in-a-list/13076 – Pablo Herreros Cantis Jun 17 '20 at 18:02
  • Eureka! That did it! The one problem that remains (which isn't exclusive to this solution; any ```st_read``` of an individual shapefile seems to be doing this) is that I have another column ```District_A``` (which stands for "District Area") which should have a precision of 12 decimal places and R seems to be rounding automatically to a whole number. I may need to post a new question for this though. – Abe Jun 19 '20 at 00:33
  • I was incorrect. R is maintaining the precision; I needed to add ```options(digits=20)``` to achieve the desired presentation. I seem to be unable to execute geometric operations on the joined geometries however. – Abe Jun 19 '20 at 02:47

1 Answers1

0

So after much fretting and some great guidance from Pablo (and his link to https://community.rstudio.com/t/simplest-way-to-modify-the-same-column-in-multiple-dataframes-in-a-list/13076), the following works:

library(tidyverse)
library(sf)

# Reads in all shapefiles from Directory that include the string "_CDs".
filenames <- list.files("Directory", pattern=".*_CDs.*shp", full.names=TRUE)

# Applies the function st_read from the sf package to each file saved as a character string to transform the file list to a list object.
shapefile_list <- lapply(filenames, st_read)

# Creates a function that transforms a problem variable to class character for all shapefile reads.
my_func <- function(data, my_col){
  my_col <- enexpr(my_col)

  output <- data %>% 
    mutate(!!my_col := as.character(!!my_col))
}

# Applies the new function to our list of shapefiles and specifies "District_I" as our problem variable.
districts <- map_dfr(shapefile_list, ~my_func(.x, District_I))
Abe
  • 393
  • 2
  • 13