0

I am working with the R programming language.

I have the following large dataset (Canadian geospatial shapefile) that I downloaded directly from the internet:

# https://stackoverflow.com/questions/75913166/no-simple-features-geometry-column-present-in-shapefile
library(sf)
library(rgdal)
# Set the URL for the shapefile
url <- "https://www12.statcan.gc.ca/census-recensement/2011/geo/RNF-FRR/files-fichiers/lrnf000r22a_e.zip"

# Create a temporary folder to download and extract the shapefile
temp_dir <- tempdir()
temp_file <- file.path(temp_dir, "lrnf000r22a_e.zip")

# Download the shapefile to the temporary folder
download.file(url, temp_file)

# Extract the shapefile from the downloaded zip file
unzip(temp_file, exdir = temp_dir)

My computer does not have a lot of RAM and can not import this file. Thus, I am trying to get more information about properties of this file without importing it.

Without importing the file into R, I was able to find out a list of all variables in this file:

> ogrInfo(dsn = temp_dir, layer = "lrnf000r22a_e")
Source: "C:\Users\me\AppData\Local\Temp\RtmpwXsVlD", layer: "lrnf000r22a_e"
Driver: ESRI Shapefile; number of rows: 2246324 
Feature type: wkbLineString with 2 dimensions
Extent: (3696309 665490.8) - (9015653 4438073)
CRS: +proj=lcc +lat_0=63.390675 +lon_0=-91.8666666666667 +lat_1=49 +lat_2=77 +x_0=6200000 +y_0=3000000 +datum=NAD83 +units=m +no_defs 
LDID: 87 
Number of fields: 21 
        name type length  typeName
1   OBJECTID   12     10 Integer64
2    NGD_UID    4      9    String
3       NAME    4     50    String
4       TYPE    4      6    String
5        DIR    4      2    String
6    AFL_VAL    4      9    String
7    ATL_VAL    4      9    String
8    AFR_VAL    4      9    String
9    ATR_VAL    4      9    String
10  CSDUID_L    4      7    String
11 CSDNAME_L    4    100    String
12 CSDTYPE_L    4      3    String
13  CSDUID_R    4      7    String
14 CSDNAME_R    4    100    String
15 CSDTYPE_R    4      3    String
16   PRUID_L    4      2    String
17  PRNAME_L    4    100    String
18   PRUID_R    4      2    String
19  PRNAME_R    4    100    String
20      RANK    4      4    String
21     CLASS    4      4    String

My Question: Without importing this file into R, is it also possible to determine the "values" that these variables can take? (e.g. PRNAME is likely Province_Name - thus, PRNAME likely contains values such as "ONTARIO", "QUEBEC", etc.).

Thanks!

Note: Metadata https://www150.statcan.gc.ca/n1/pub/92-500-g/2021001/tbl/tbl_4.1-eng.htm

# file transfers using R
file.copy(from = file.path(temp_dir, "file_name.txt"),
          to = file.path(getwd(), "file_name.txt"))
stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

2

The data seems to be stored in lrnf000r22a_e.dbf which appears to be in a dbase file format. This type of file does not store metadata about all possible column values. The text values are just stored in the body of the file itself. You'd need to scan the entire file to find all possible values. There are built in parsers like foreign::read.dbf but they assume you want to load all the data into memory. If you just want to get a list of unique values, you'd probably have to write your own custom parser with that purpose in mind.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • @ MrFlick: thank you so much for your answer! Did you mean something like this? foreign::read.dbf(file.path(temp_dir, "lrnf000r22a_e.dbf")) – stats_noob Apr 03 '23 at 18:26
  • Suppose I am unable to load everything in to memory ... can you please explain what you mean by "write your own parser"? – stats_noob Apr 03 '23 at 18:26
  • 1
    You'd have to read the dbf file format spec and extract the data you want from the bytes of the file itself. Most often that would be done with something like Rcpp to you can just read chunks of bytes at a time. In theory you could could read bytes of a file at a time in R with `readBin` but that might be slow for a large file. It will be a lot of work either way. – MrFlick Apr 03 '23 at 18:36