Fully aware that this type of question has been asked hundreds of times.
Still, I could not find an answer to the specific problem I am describing, which is about:
- performance (i.e. I know how to do what I need, but it's too slow in some cases, so I am looking for a faster solution)
- good programming practice (i.e. I am questioning whether the approach I chose is 'clean' rather than roundabout or inefficient for other reasons)
I have a data.frame with numerical and character columns. I need to create a data.frame summary from it, grouping by one of the character columns (ID
), and reporting 1) some stats about some numerical columns within each group, and 2) some character concatenations (i.e. the report has mixed data types - that's what makes it tricky, at least for me, and that's why I am asking for advice).
Here is the R
script:
# Simulate original data.frame
set.seed(384092)
N <- 10000
d <- data.frame("ID" = paste0(sample(LETTERS, N, replace = T), sprintf("%03.0f", sample(1:floor(sqrt(N)), N, replace = T )) ), stringsAsFactors = F)
d["set"] <- sample(LETTERS, N, replace = T)
d["P"] <- runif(N, -20, 120)
d["K"] <- rnorm(N, 10, 0.5)
# Make summary
# For each unique ID, report: ID, number of rows of d, mean of P, sd of P, comma-separated list of unique set's
# Method 1: rbind data.frames from 'by'
time.1 <- system.time({
d_summary.1 <- do.call(rbind, by(d, d$ID, function(dd) {
data.frame("ID" = dd$ID[1], "N" = nrow(dd), "P_mean" = mean(dd$P), "P_sd" = sd(dd$P), "sets" = paste(unique(dd$set), collapse = ","))
})
)
})
cat("\ntime.1 =",time.1,"\n")
print(sapply(d_summary.1, class))
# Method 2: create a list of lists and combine them at the end
# https://stackoverflow.com/a/68162050/6376297
time.2 <- system.time({
time.2.1 <- system.time({d_summary.2 <- by(d, d$ID, function(dd) {
list("ID" = dd$ID[1], "N" = nrow(dd), "P_mean" = mean(dd$P), "P_sd" = sd(dd$P), "sets" = paste(unique(dd$set), collapse = ","))
})
})
d_summary.2 <- do.call(rbind, lapply(d_summary.2, data.frame))
})
cat("\ntime.2.1 =",time.2.1)
cat("\ntime.2 =",time.2,"\n")
print(sapply(d_summary.2, class))
which on my PC produces the following output:
time.1 = 1.72 0 1.72 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.2.1 = 0.3 0 0.29 NA NA
time.2 = 1.79 0 1.82 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
The linked post https://stackoverflow.com/a/68162050/6376297 specifically mentions that the kind of handling used in method 2 is necessary to avoid coercing all columns to a single data type.
And indeed, any solution I tried that relies on making an intermediate matrix, as fully expected, results in coercion to character.
This is really unfortunate, because as shown by time.2.1
, the initial formation of a list of lists containing the required information (and still retaining all the original data types) takes only 1/6 - 1/5 of the total time.
And you need to imagine that I am doing this on d
's at least 10-100 times larger than this example.
Would anyone be able to advise / suggest a faster method to do this?
Thanks!
EDIT : follow up from users' feedback
Trial of the dplyr
(4) and data.table
(5) method, plus a couple more base R
methods (using aggregate
, (6) and (7)) that are more involved but might be somewhat competitive with those two.
# Method 4: dplyr
require(dplyr)
time.4 <- system.time({
d %>%
group_by(ID) %>%
summarise(N = n(),
P_mean = mean(P),
P_sd = sd(P),
sets = paste(unique(set), collapse = ",")) -> d_summary.4
})
cat("\ntime.4 =",time.4,"\n")
print(sapply(d_summary.4, class))
# Method 5: data.table
require(data.table)
time.5 <- system.time({
setDT(d)
d_summary.5 <- d[, .(N = .N,
P_mean = mean(P),
P_sd = sd(P),
sets = toString(unique(set))), ID]
d_summary.5 <- as.data.frame(d_summary.5)
})
cat("\ntime.5 =",time.5,"\n")
print(sapply(d_summary.5, class))
# Method 6: aggregate each column separately and merge
time.6 <- system.time({
d_summary.6 <- setNames(as.data.frame(table(d$ID), stringsAsFactors = F),c("ID","N"))
d_summary.6 <- merge(d_summary.6, setNames(aggregate(P ~ ID, data = d, FUN = mean),c("ID","P_mean")), by = "ID")
d_summary.6 <- merge(d_summary.6, setNames(aggregate(P ~ ID, data = d, FUN = sd),c("ID","P_sd")), by = "ID")
d_summary.6 <- merge(d_summary.6, setNames(aggregate(set ~ ID, data = d, FUN = function(x) {paste(unique(x),collapse=",")}),c("ID","sets")), by = "ID")
})
cat("\ntime.6 =",time.6,"\n")
print(sapply(d_summary.6, class))
# Method 7: aggregate each column separately and cbind (this assumes that both table and aggregate will report all values of ID, sorted)
time.7 <- system.time({
d_summary.7 <- setNames(as.data.frame(table(d$ID), stringsAsFactors = F),c("ID","N"))
d_summary.7 <- cbind(d_summary.7, "P_mean" = aggregate(P ~ ID, data = d, FUN = mean)[,2])
d_summary.7 <- cbind(d_summary.7, "P_sd" = aggregate(P ~ ID, data = d, FUN = sd)[,2])
d_summary.7 <- cbind(d_summary.7, "sets" = aggregate(set ~ ID, data = d, FUN = function(x) {paste(unique(x),collapse=",")})[,2])
})
cat("\ntime.7 =",time.7,"\n")
print(sapply(d_summary.7, class))
Timing:
time.1 = 1.73 0.02 1.77 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.2.1 = 0.29 0 0.3 NA NA
time.2 = 1.83 0.01 1.84 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.4 = 0.13 0 0.13 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.5 = 0.08 0 0.08 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.6 = 0.25 0 0.25 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"
time.7 = 0.25 0 0.25 NA NA
ID N P_mean P_sd sets
"character" "integer" "numeric" "numeric" "character"