Optimizing dplyr summarise() with multiple sum() on a large dataset

Question

I have a table with 22 million rows, each containing one set of vital signs, a patient ID, and a time. I'm trying to get a summary table containing the ID and the number of non-null values for each vital sign (column).

My code below executes and gives reasonable answers but takes forever. I'm wondering if there is a more efficient approach to this problem. Any optimization strategies?

The code below converts the name of the ID from "pcrid" to "PCRID" to make the resulting table compatible with my legacy code. I also filter the table a bit. This works quickly on the entire dataset so this doesn't seem to be the slow part.

Here's how it performed with different size datasets (using the head()):

1,000 rows = 0.2 seconds
10,000 rows = 1.7 seconds
100,000 rows = 15 seconds
1,000,000 rows = 2.9 minutes
22,000,000 rows = 42 minutes

    Start <- Sys.time()
    vitals_all <- vitals_all.df %>%
      select(PCRID = pcrid, everything()) %>%
      filter((pta == "no" | pta == "unk") & !is.na(pta)) %>%
      group_by(PCRID) %>%
      summarise(
        n_AVPU = sum(!is.na(avpu)),
        n_SBP = sum(!is.na(sbp)),
        n_DBP = sum(!is.na(dbp)),
        n_HR = sum(!is.na(pulserate)),
        n_RR = sum(!is.na(rr)),
        n_SpO2 = sum(!is.na(spo2)),
        n_EtCO2 = sum(!is.na(etco2)),
        n_CO = sum(!is.na(co)),
        n_BGL = sum(!is.na(glucose)),
        n_Temp = sum(!is.na(tempf)),
        n_Pain = sum(!is.na(painscale)),
        n_GCS = sum(!is.na(gcs))) 
    Sys.time() - Start

For problems with many groups, `data.table` will likely be much faster than `dplyr`. — Jon Spring, Jun 17 '19 at 23:30
For example, see here for similar question: https://stackoverflow.com/questions/29683752/efficiently-counting-non-na-elements-in-data-table — Jon Spring, Jun 17 '19 at 23:39

score 3 · Accepted Answer · answered Jun 18 '19 at 00:12

I'm a beginner with data.table, but I know it can have significant performance improvements over dplyr when there are a large number of groups to calculate.

I haven't figured out the data.table syntax to both group by PCRID and calculate counts of non-NAs across many columns. To get around that, I tried using dtplyr, a dplyr-based frontend for data.table, and got some substantial performance improvements.

Using some fake data (see bottom) of similar size as yours, the counting from your post took 197 seconds, but when I loaded data.table and dtplyr and re-ran it, it took 77 seconds, shaving off 61% of the time, with the same output. Your results may vary, but I wouldn't be surprised if there are further data.table efficiencies possible to reduce that time much further.

library(data.table); library(dtplyr)
vitals_fake_DT <- data.table(vitals_fake)

vitals_fake_DT %>%
  arrange(PCRID) %>% # to make output order the same way between methods
  group_by(PCRID) %>%
  summarise(
    n_AVPU = sum(!is.na(avpu)),
    n_SBP = sum(!is.na(sbp)),
    # etc.

Fake data with 20 million rows and 10 million groups:

rows = 20000000
grps = 10000000 # max, somewhat less in practice
set.seed(42)
vitals_fake <- data.frame(
  PCRID = sample(1:grps, size = rows, replace = T),
  avpu = sample(c(NA, 1:10), size = rows, replace = T),
  sbp = sample(c(NA, 1:10), size = rows, replace = T),
  dbp = sample(c(NA, 1:10), size = rows, replace = T),
  pulserate    = sample(c(NA, 1:10), size = rows, replace = T),
  rr    = sample(c(NA, 1:10), size = rows, replace = T),
  spo2  = sample(c(NA, 1:10), size = rows, replace = T),
  etco2 = sample(c(NA, 1:10), size = rows, replace = T),
  co    = sample(c(NA, 1:10), size = rows, replace = T),
  glucose   = sample(c(NA, 1:10), size = rows, replace = T),
  tempf  = sample(c(NA, 1:10), size = rows, replace = T),
  painscale  = sample(c(NA, 1:10), size = rows, replace = T),
  gcs   = sample(c(NA, 1:10), size = rows, replace = T)
)

Cole · Answer 2 · 2019-06-18T02:40:26.543

The answer is highly dependent on what the data looks like, especially how many rows per group there are.

For instance, with 100,000 groups and 42 rows (i.e., 4,200,000 total rows), I get 2 seconds for data.table and versus 84 seconds for dplyr. For the same total rows with only 100 groups, I get 0.28 seconds for dt and 0.37 seconds for dplyr.

I also did @Jon Springs's example with 2 rows per group with 10,000,000 groups. My data.table solution was 339 seconds and I stopped my dplyr version at 2,464 seconds. Maybe part of the solution is to get a better processor like @Jon's :).

EDIT: I think that if there are a lot of groups, melting/gathering the data first is faster. It takes about 60 second's for @Jon's 10,000,000 group example. Note: to get it back to wide format, it adds another 100 seconds ending up about twice as fast as strictly data.table

melt(dt, id.vars = 'ID')[!is.na(value), .N, by = .(ID, variable)]
#or to end wide
dcast(melt(dt, id.vars = 'ID')[!is.na(value), .N, by = .(ID, variable)], ID ~ variable)

Here are the function calls I used. Note I used summarized_all() because I didn't have it in me to write out all of those columns.

#Assume using all columns except the ID column

#data.table
dt[, lapply(.SD, function(x) sum(!is.na(x))), by = ID]

#dplyr
tib%>%
  group_by(ID)%>%
  summarize_all(~sum(!is.na(.)))

Data:

n_groups <- 10
n_rows <- 42
n_cols <- 12

NA_prob <- 0.3

library(data.table)
library(dplyr)

set.seed(0)
dt <- data.table(ID = rep(seq_len(n_groups), each = n_rows)
           , matrix(sample(x = c(NA_integer_, 0L)
                           , size = n_rows * n_cols * n_groups
                           , replace = T
                           , prob = c(NA_prob, 1 - NA_prob))
                    , ncol = 12)
           )

tib <- as_tibble(dt)

score 2 · Answer 3 · answered Jun 18 '19 at 07:37

I made an attempt at this. I think you can use Hadley Wickhams multidplyr which takes advantage of multiple cores. You use partition instead of group_by, and after summarise you collect the result.

I also made the code more dynamic by using rename_at for name change of columns and mutate_at to create values of 1 and 0 before summarising the data. dummy_ creates a 1 if not NA and 0 otherwise. This code seemed to work fast:

# devtools::install_github("hadley/multidplyr")
library(dplyr)
library(multidplyr)
library(hablar)

vitals_all <- vitals_all.df %>% 
  rename_at(vars(-PCRID), ~paste0("n_", toupper(.))) %>% 
  mutate_at(vars(-PCRID), ~dummy_(!is.na(.))) %>% 
  partition(PCRID) %>% 
  summarise_all(~sum(.)) %>% 
  collect()

Fake data borrowed from Jon Spring (thanks!):

rows = 20000000
grps = 10000000 # max, somewhat less in practice
set.seed(42)
vitals_all.df <- data.frame(
  PCRID = sample(1:grps, size = rows, replace = T),
  avpu = sample(c(NA, 1:10), size = rows, replace = T),
  sbp = sample(c(NA, 1:10), size = rows, replace = T),
  dbp = sample(c(NA, 1:10), size = rows, replace = T),
  pulserate    = sample(c(NA, 1:10), size = rows, replace = T),
  rr    = sample(c(NA, 1:10), size = rows, replace = T),
  spo2  = sample(c(NA, 1:10), size = rows, replace = T),
  etco2 = sample(c(NA, 1:10), size = rows, replace = T),
  co    = sample(c(NA, 1:10), size = rows, replace = T),
  glucose   = sample(c(NA, 1:10), size = rows, replace = T),
  tempf  = sample(c(NA, 1:10), size = rows, replace = T),
  painscale  = sample(c(NA, 1:10), size = rows, replace = T),
  gcs   = sample(c(NA, 1:10), size = rows, replace = T)
)

I have not taken into account your filtering and extra manipulating of the df. Just add them if you want. Also, if you have more columns than the ones used above, you may want to drop them before applying my code since it applies functions to "all" columns.

Optimizing dplyr summarise() with multiple sum() on a large dataset

3 Answers3