In R, calculate a dataframe as efficifiently and as fast as possible from thousands of external files

Question

I am building a Shiny application in which a large ggplot2 fortified dataframe needs to be calculated over and over again, using a large amount of external source files. I am searching for the fastest and most efficient way to do this. In the following paragraph I will delve a little bit more into the subject and the code I have so far and also provide the input data to enable your kind assistance.

I am using the Helsinki Region Travel Time Matrix 2018, a dataset provided by Digital Geography Lab, a research group in the University of Helsinki. This data uses a generalised map of Helsinki capital region, in 250 x 250 meter cells (in my code grid_f), to calculate travel times between all cells in the map (grid ids are called YKR_ID, n=13231) by public transport, private car, bicycle and by foot. The calculations are stored in delimited .txt files, one text file for all the travel times to a specific cell id. The data is available for download at this website, under "Download the data". NB, the unzipped data is 13.8 GB in size.

Here is a selection from a text file in the dataset:

from_id;to_id;walk_t;walk_d;bike_s_t;bike_f_t;bike_d;pt_r_tt;pt_r_t;pt_r_d;pt_m_tt;pt_m_t;pt_m_d;car_r_t;car_r_d;car_m_t;car_m_d;car_sl_t
5785640;5785640;0;0;-1;-1;-1;0;0;0;0;0;0;-1;0;-1;0;-1
5785641;5785640;48;3353;51;32;11590;48;48;3353;48;48;3353;22;985;21;985;16
5785642;5785640;50;3471;51;32;11590;50;50;3471;50;50;3471;22;12167;21;12167;16
5785643;5785640;54;3764;41;26;9333;54;54;3764;54;54;3764;22;10372;21;10370;16
5787544;5785640;38;2658;10;7;1758;38;38;2658;38;38;2658;7;2183;7;2183;6

My interest is to visualise (with ggplot2) this 250x250m Helsinki region map for one travel mode, the private car, using any of the possible 13231 cell ids, repeatedly if the user wants. Because of this it is important that the dataframe fetch is as fast and efficient as possible. For this question, let's concentrate on the fetching and processing of the data from the external files and use only one specific id value.

In a nutshell, After I have produced a ggplot2::fortify() version of the 250 x 250 meter grid spatial dataset grid_f,

I need to scan through all the 13231 Travel Time Matrix 2018 text files
Pick only the relevant columns (from_id, to_id, car_r_t, car_m_t, car_sl_t) in each file
Pick the relevant row using from_id (in this case, origin_id <- "5985086") in each file
Join the the resulting row to the fortified spatial data grid_f

My code is as follows:

# Libraries
library(ggplot2)
library(dplyr)
library(rgdal)
library(data.table)
library(sf)
library(sp)

# File paths. ttm_path is the folder which contains the unchanged Travel
# Time Matrix 2018 data from the research group's home page
ttm_path <- "HelsinkiTravelTimeMatrix2018"
gridpath <- "MetropAccess_YKR_grid_EurefFIN.shp"


#### Import grid cells
# use this CRS information throughout the app
app_crs <- sp::CRS("+init=epsg:3067")

# Read grid shapefile and transform
grid_f <- rgdal::readOGR(gridpath, stringsAsFactors = TRUE) %>%
  sp::spTransform(., app_crs) %>%
  # preserve grid dataframe data in the fortify
  {dplyr::left_join(ggplot2::fortify(.),
                    as.data.frame(.) %>%
                      dplyr::mutate(id = as.character(dplyr::row_number() - 1)))} %>%
  dplyr::select(-c(x, y))

The code above this point is meant to run only once. The code below, more or less, would be run over and over with different origin_ids.

#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)

# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c(1, 2, 14, 16, 18)

# grid_f as data.table version
dt_grid <- as.data.table(grid_f)

# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path, 
                        pattern = ".txt$", 
                        recursive = TRUE, 
                        full.names = TRUE)
all_files <- all_files[-length(all_files)]

# lapply function
TTM18_fetch <- function(x, col_range, origin_id) {
  res <- fread(x, select = col_range)
  res <- subset(res, from_id == origin_id)
  return(res)
}

# The part of the code that needs to be fast and efficient
result <- 
  lapply(all_files, FUN = TTM18_fetch, col_range, origin_id_num) %>%
  data.table::rbindlist(., fill = TRUE) %>%
  data.table::merge.data.table(dt_grid, ., by.x = "YKR_ID", by.y = "to_id")

The dataframe result should have 66155 rows of 12 variables, five rows for each 250x250 meter grid cell. The columns are YKR_ID, long, lat, order, hole, piece, id, group, from_id, car_r_t, car_m_t, car_sl_t.

My current lapply() and data.table::fread() solution takes about 2-3 minutes to complete. I think this is already a good achievement, but I can't help and think there are better and faster ways to complete this. So far, I have tried these alternatives to what I now have:

A conventional for loop: that was obviously a slow solution
I tried to teach myself more about vectorised functions in R, but that did not lead anywhere. Used this link
Tried to dabble with with() unsuccessfully using this SO question, inspired by this SO question
Looked into package parallel but ended up not utilising that because of the Windows environment I am using
Tried to find alternative ways to solve this with apply() and sapply() but nothing noteworthy came out of that.

As to why I didn't do all this to the data before ggplot2::fortify, I simply found it troublesome to work with a SpatialPolygonsDataFrame.

Thank you for your time.

it should be faster to read data from RData files than from CSV files ... — Ben Bolker, May 31 '20 at 18:36
arrow file formats like parquet and feather are also pretty damn fast, look into the vroom package — Bruno, May 31 '20 at 18:43
I wrote a thing about subsetting from files via Drill+sergeant: https://alistaire.rbind.io/blog/querying-across-files-with-apache-drill/ — alistaire, May 31 '20 at 18:45
and yes, if the data has a lot of columns but you only want to read in some of them, parquet is a good choice — alistaire, May 31 '20 at 18:47
Have you looked at the [`fst` package](https://github.com/fstpackage/fst)? It's insanely fast and would be perfect for what you are trying to do. — Dewey Brooke, Jun 01 '20 at 11:54
Thank you all for your inputs. In this situation, I did not want to use solutions using database connections, even if they would be locally hosted. I also tried out ``vroom``, which I found weirdly unwieldly inside a ``lapply()`` function shown in my original question. ``arrow`` and parquet format came close provide the solution, but in the end it was the flexible function parameters of ``fst::read_fst()`` that won me over. — Vesanen, Jun 01 '20 at 19:43

score 1 · Accepted Answer · answered Jun 01 '20 at 13:31

Whenver I’m trying to figure out how to improve the performance of my R functions, I generally use the following approach. First, I look for any function calls that may be unesscesary or identify places where multiple function calls can be simplified into one. Then, I look for places in my code that are incurring the greatest time penalty by benchmarking each part separately. This can easily be done using the microbenchmark package.

For example, we can ask if we get better performance with or without piping (e.g. %>%).

# hint... piping is always slower
library(magrittr)
library(microbenchmark)
microbenchmark(
  pipe = iris %>% subset(Species=='setosa'),
  no_pipe = subset(iris, Species=='setosa'),
  times = 200)

Unit: microseconds
    expr     min      lq     mean   median       uq      max neval cld
    pipe 157.518 196.739 308.1328 229.6775 312.6565 2473.582   200   b
 no_pipe  84.894 116.386 145.4039 126.1950 139.4100  612.492   200  a

Here, we find that removing subseting a data.frame without piping takes nearly half the time to execute!

Next, I determine the net time penalty for each place I benchmarked by multipling the execution time by total number of times it needs to be executed. For the areas with the greatest net time penalty, I try to replace it with faster functions and/or try reduce the total number of times it needs to be executed.

TLDR

In your case, you can speed things up by using the fst package although you would need to convert your csv files to fst files.

# before
TTM18_fetch <- function(x, col_range, origin_id) {
  res <- data.table::fread(x, select = col_range)
  res <- subset(res, from_id == origin_id)
  return(res)
}

# after (NB x needs to be a fst file)
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')

TTM18_fetch <- function(x, col_range, origin_id) {
  res <- fst::read_fst(path = x,
                       columns = col_range,
                       as.data.table = TRUE)[from_id==origin_id]
  return(res)
}

To convert your csv files to fst

library(data.table)
library(fst)
ttm_path <- 'REPLACE THIS'
new_ttm_path <- 'REPLACE THIS'

# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path, 
                        pattern = ".txt$", 
                        recursive = TRUE, 
                        full.names = TRUE)

all_files <- all_files[-grepl('[Mm]eta', all_files)]

# creating new file paths and names for fst files
file_names <- list.files(path = ttm_path, 
                        pattern = ".txt$", 
                        recursive = TRUE)
file_names <-  file_names[-grepl('[Mm]eta', file_names)]

file_names <- gsub(pattern = '.csv$',
                   replacement = '.fst', 
                   x =file_names)

file_names <- file.path(new_ttm_path, file_names)

# csv to fst conversion

require(progress) # this will help you create track of things
pb <- progress_bar$new(
  format = " :what [:bar] :percent eta: :eta",
  clear = FALSE, total = length(file_names), width = 60)


# an index file to store from_id file locations
from_id_paths <- data.table(from_id = numeric(), 
                            file_path = character())

for(i in seq_along(file_names)){

  pb$tick(tokens = list(what = 'reading'))
  tmp <- data.table::fread(all_files[i], key = 'from_id')

  pb$update(tokens = list(what = 'writing'))
  fst::write_fst(tmp,
                 compress = 50,  # less compressed files read faster
                 path = file_names[i] )  

  pb$update(tokens = list(what = 'indexing'))
  from_id_paths <- rbind(from_id_paths,  
                         data.table(from_id = unique(tmp$from_id),
                                    file_path = file_names[i]))

}

setkey(from_id_paths, from_id)
write_fst(from_id_paths,
          path =  file.path('new_ttm_path', 'from_id_index.fst'),
          compress = 0)

This would be the replacement

library(fst)
library(data.table)
new_ttm_path <- 'REPLACE THIS'

#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)

# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')

# grid_f as data.table version
dt_grid <- as.data.table(grid_f)


nescessary_files <- read_fst(path = file.path(new_ttm_path,
                                              'from_id_index.fst'),
                             as.data.table = TRUE
                             )[from_id==origin_id,file_path]


TTM18_fetch <- function(x, col_range, origin_id) {
  res <- fst::read_fst(path = x,
                       columns = col_range,
                       as.data.table = TRUE)[from_id==origin_id]
  return(res)
}


result <-  rbindlist(lapply(nescessary_files, FUN = TTM18_fetch, col_range,  origin_id_num),
                     fill = TRUE)
result <- data.table::merge.data.table(dt_grid, result, by.x = "YKR_ID", by.y = "to_id")

Thank you @Dewey Brooke. Converting my data to fst and then reading the fst format dataset made any recalculation of the dataframe drop from 2-3 minutes runtime to ~5 seconds. I accepted this answer as the question solution. — Vesanen, Jun 01 '20 at 19:40

In R, calculate a dataframe as efficifiently and as fast as possible from thousands of external files

1 Answers1

TLDR

To convert your csv files to fst

This would be the replacement