How to load in big data sets with st_read without exceeding RAM

Question

I'm working with a collection of big data sets https://catalogue.data.gov.bc.ca/dataset/vegetation-composite-polygons and am running into issues with st_read taking a long time to load in data. I'm running R Studio 1.2.1335 and R 3.6.1 on a 2018 MacBook Pro with 32 GB of RAM.

The data in the above link only represents 1/8 of the total data I have (I have data for 2008-2018).

Currently, loading in the unformatted data with st_read and .gdb requires most of the day (5-8 hours). After I format and trim the data to include only what I want, loading in the data only takes 5-10 minutes.

I am interested in solutions to avoid the process taking 10 days to load in each layer of the .gdb and formatting the data. Ideally, I'd be able to load in all the layers at once and format all 62,795,746 rows at the same time (taken from st_layers).

I've tried Loading data using .gdb, .shp, and .gpkg.

.gpkg is fast once I get the data formatted.

If you only need a small portion of the file (as it seems you do due to differences in read times)....have you tried using bash or something to subset the data before reading? — Raag Agrawal, Aug 12 '19 at 15:40
I have not, mainly because I'm much more comfortable in R than I am in bash. My bash is limited to moving data around and submitting tasks to a cluster. — mike_howe, Aug 12 '19 at 15:42
I don't want to answer your question by telling you to learn an entire new language - but bash is definitely quite useful for solving lots of big data handling problems. If you're doing relatively simple cutting, try this: https://www.datacamp.com/community/tutorials/shell-commands-data-scientist — Raag Agrawal, Aug 12 '19 at 15:44
What's your "formatting and trim" process? Are you selecting and hence reducing the number of features? Dropping columns? — Spacedman, Aug 12 '19 at 16:00
I'm not all that familiar with spatial data file types, but if they're a format where you can process them line by line, you may want to take a look at the answers to [this](https://stackoverflow.com/questions/12626637/reading-a-text-file-in-r-line-by-line) question. If you can't process the data line by line, then I don't think bash will help you either. In that case, you may want to think about running your code on a server in the cloud. — Brendan A., Aug 12 '19 at 16:00
@Spacedman, my formatting and trim process is selecting the polygons that contain certain species (in this case, any pinus species) and selecting only ~15 columns instead of 200. I've tried to use the query in st_read, but haven't been able to get it to work on a openGDB file. — mike_howe, Aug 12 '19 at 16:06
@mike_howe that sounds like a new question that could be worth asking. — Spacedman, Aug 12 '19 at 16:31
Might be worth loading your data into a spatial database like postgis. You can run sql queries through the `st_read` function. — Chris, Aug 12 '19 at 17:15
(from memory) I think you can also supply the `query` directly in `st_read()` to run directly on the `.gdb` file — SymbolixAU, Aug 13 '19 at 23:07

score 0 · Answer 1 · answered Jan 03 '23 at 17:36

If the query argument in sf::st_read() isn't working, you could potentially import it in using the R-ArcGIS bridge. This assumes you have a licensed copy of ArcGIS Pro. arc.select() has a fields argument where you can specify which fields you want, and a where_clause argument for an SQL where expression. If the data are available through AGOL, you can also add a spatial filter, where the querying is done on the server (see arcpullr).

How to load in big data sets with st_read without exceeding RAM

1 Answers1