How can I optimize this pattern matching function to handle large data sets in R?

Question

How do I optimize this pattern matching function?

I have a function that accepts a dataframe and a vector of strings called targets. The function returns a data frame that keeps only the rows which have at least one value that matches one of the targets. I'm running this function on dataframes with up to 28 million rows, so speed is critical.

match_targets <- function(df, target_codes){
   pattern = paste0("^", target_codes, collapse = "|")
   df <- df %>% 
         filter( rowSums( sapply( ., grepl, pattern) ) > 0)
}

I'm new to the dplyr library, so maybe there's some functions that I am not taking advantage of. Please let me know if there's any way to improve the performance of this function

Tried on large data, expected faster runtime

Could you post a snippet of your data? Paste the output of the function `dput(df)` — Ricardo Semião e Castro, Jun 01 '23 at 23:53

score 0 · Answer 1 · answered Jun 02 '23 at 05:04

In R, there are many factors which can affect performance. In order for others to provide 'good' answers, it would be helpful if you could edit your question to include your current runtime, the sort of speed-up you expect, and a minimal reproducible example (alternatively, describe any differences between your actual data and my example below, e.g. how long is your target_codes vector, what are the dimensions of your dataframes, etc). Also, how many dataframes are you applying this to, and have you looked at other alternatives, e.g. data.table, or other languages, e.g. awk?

Your code didn't run 'as is' on my example dataset, so I ran an altered version. I also tested PCRE regex (perl = TRUE in the grep) which can improve speed for reasons:

library(tidyverse)

match_targets_dplyr <- function(df, target_codes){
  pattern = paste0("^", target_codes, collapse = "|")
  df <- filter(df, if_any(everything(), ~grepl(pattern, .x)))
  return(df)
}

match_targets_dplyr_perl <- function(df, target_codes){
  pattern = paste0("^", target_codes, collapse = "|")
  df <- filter(df, if_any(everything(), ~grepl(pattern, .x, perl = TRUE)))
  return(df)
}

match_targets_base <- function(df, target_codes){
  pattern = paste0("^", target_codes, collapse = "|")
  df <- df[rowSums(apply(df, 2, \(x) grepl(pattern, x))) > 0, , drop = FALSE]
  return(df)
}

match_targets_base_perl <- function(df, target_codes){
  pattern = paste0("^", target_codes, collapse = "|")
  df <- df[rowSums(apply(df, 2, \(x) grepl(pattern, x, perl = TRUE))) > 0, , drop = FALSE]
  return(df)
}

# grab some data
df <- read_csv("https://raw.github.com/VladAluas/Text_Analysis/master/Datasets/Text_review.csv")
#> Rows: 433 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): Model, Segment, Text
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_large <- df

# the "Text" field has some non-standard characters; remove them
df_large$Text <- iconv(df_large$Text, from = "UTF-8", to = "ASCII", sub = "")

# select some target codes
target_codes = c("Perform", "Ca", "Net")

# check you get the same answer from all of the functions
df2 <- match_targets_dplyr(df_large, target_codes)
df3 <- match_targets_dplyr_perl(df_large, target_codes)
df4 <- match_targets_base(df_large, target_codes)
df5 <- match_targets_base_perl(df_large, target_codes)
all.equal(df2, df3, df4, df5, check.attributes = FALSE)
#> [1] TRUE

# increase the size of the data to ~28mil rows
repeat{df_large <- df_large %>% bind_rows(df_large); if(NROW(df_large) > 2.8e7) {break}}

# benchmark the functions on the 28mil rows
library(microbenchmark)
res <- microbenchmark(match_targets_dplyr(df_large, target_codes),
                      match_targets_dplyr_perl(df_large, target_codes),
                      match_targets_base(df_large, target_codes),
                      match_targets_base_perl(df_large, target_codes),
                      times = 2)
autoplot(res)

^{Created on 2023-06-02 with reprex v2.0.2}

It's hard to see what's going on in the reprex image; here it is with the points highlighted:

TL,DR: dplyr is a bit faster, and PCRE was much faster using this example dataset.

How can I optimize this pattern matching function to handle large data sets in R?

1 Answers1