I have a table with 22 million rows, each containing one set of vital signs, a patient ID, and a time. I'm trying to get a summary table containing the ID and the number of non-null values for each vital sign (column).
My code below executes and gives reasonable answers but takes forever. I'm wondering if there is a more efficient approach to this problem. Any optimization strategies?
The code below converts the name of the ID from "pcrid" to "PCRID" to make the resulting table compatible with my legacy code. I also filter the table a bit. This works quickly on the entire dataset so this doesn't seem to be the slow part.
Here's how it performed with different size datasets (using the head()):
- 1,000 rows = 0.2 seconds
- 10,000 rows = 1.7 seconds
- 100,000 rows = 15 seconds
- 1,000,000 rows = 2.9 minutes
- 22,000,000 rows = 42 minutes
Start <- Sys.time()
vitals_all <- vitals_all.df %>%
select(PCRID = pcrid, everything()) %>%
filter((pta == "no" | pta == "unk") & !is.na(pta)) %>%
group_by(PCRID) %>%
summarise(
n_AVPU = sum(!is.na(avpu)),
n_SBP = sum(!is.na(sbp)),
n_DBP = sum(!is.na(dbp)),
n_HR = sum(!is.na(pulserate)),
n_RR = sum(!is.na(rr)),
n_SpO2 = sum(!is.na(spo2)),
n_EtCO2 = sum(!is.na(etco2)),
n_CO = sum(!is.na(co)),
n_BGL = sum(!is.na(glucose)),
n_Temp = sum(!is.na(tempf)),
n_Pain = sum(!is.na(painscale)),
n_GCS = sum(!is.na(gcs)))
Sys.time() - Start