0

Fully aware that this type of question has been asked hundreds of times.
Still, I could not find an answer to the specific problem I am describing, which is about:

  • performance (i.e. I know how to do what I need, but it's too slow in some cases, so I am looking for a faster solution)
  • good programming practice (i.e. I am questioning whether the approach I chose is 'clean' rather than roundabout or inefficient for other reasons)

I have a data.frame with numerical and character columns. I need to create a data.frame summary from it, grouping by one of the character columns (ID), and reporting 1) some stats about some numerical columns within each group, and 2) some character concatenations (i.e. the report has mixed data types - that's what makes it tricky, at least for me, and that's why I am asking for advice).

Here is the R script:

# Simulate original data.frame

set.seed(384092)

N <- 10000

d <- data.frame("ID" = paste0(sample(LETTERS, N, replace = T),  sprintf("%03.0f", sample(1:floor(sqrt(N)), N, replace = T )) ), stringsAsFactors = F)
d["set"] <- sample(LETTERS, N, replace = T)
d["P"] <- runif(N, -20, 120)
d["K"] <- rnorm(N, 10, 0.5)

# Make summary
# For each unique ID, report: ID, number of rows of d, mean of P, sd of P, comma-separated list of unique set's

# Method 1: rbind data.frames from 'by'

time.1 <- system.time({
  d_summary.1 <- do.call(rbind, by(d, d$ID, function(dd) {
    data.frame("ID" = dd$ID[1], "N" = nrow(dd), "P_mean" = mean(dd$P), "P_sd" = sd(dd$P), "sets" = paste(unique(dd$set), collapse = ","))
  })
  )
})

cat("\ntime.1 =",time.1,"\n")
print(sapply(d_summary.1, class))

# Method 2: create a list of lists and combine them at the end
# https://stackoverflow.com/a/68162050/6376297

time.2 <- system.time({
  time.2.1 <- system.time({d_summary.2 <- by(d, d$ID, function(dd) {
    list("ID" = dd$ID[1], "N" = nrow(dd), "P_mean" = mean(dd$P), "P_sd" = sd(dd$P), "sets" = paste(unique(dd$set), collapse = ","))
  })
  })
  d_summary.2 <- do.call(rbind, lapply(d_summary.2, data.frame))
})

cat("\ntime.2.1 =",time.2.1)
cat("\ntime.2 =",time.2,"\n")
print(sapply(d_summary.2, class))

which on my PC produces the following output:

time.1 = 1.72 0 1.72 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character" 

time.2.1 = 0.3 0 0.29 NA NA
time.2 = 1.79 0 1.82 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character"

The linked post https://stackoverflow.com/a/68162050/6376297 specifically mentions that the kind of handling used in method 2 is necessary to avoid coercing all columns to a single data type.
And indeed, any solution I tried that relies on making an intermediate matrix, as fully expected, results in coercion to character.

This is really unfortunate, because as shown by time.2.1, the initial formation of a list of lists containing the required information (and still retaining all the original data types) takes only 1/6 - 1/5 of the total time.
And you need to imagine that I am doing this on d's at least 10-100 times larger than this example.

Would anyone be able to advise / suggest a faster method to do this?

Thanks!


EDIT : follow up from users' feedback

Trial of the dplyr (4) and data.table (5) method, plus a couple more base R methods (using aggregate, (6) and (7)) that are more involved but might be somewhat competitive with those two.

# Method 4: dplyr

require(dplyr)

time.4 <- system.time({
  d %>% 
    group_by(ID) %>% 
    summarise(N = n(),
              P_mean = mean(P),
              P_sd = sd(P),
              sets = paste(unique(set), collapse = ",")) -> d_summary.4
})

cat("\ntime.4 =",time.4,"\n")
print(sapply(d_summary.4, class))

# Method 5: data.table

require(data.table)

time.5 <- system.time({
  setDT(d)
  
  d_summary.5 <- d[, .(N = .N, 
        P_mean = mean(P), 
        P_sd = sd(P), 
        sets = toString(unique(set))), ID]
  
  d_summary.5 <- as.data.frame(d_summary.5)
  
})

cat("\ntime.5 =",time.5,"\n")
print(sapply(d_summary.5, class))

# Method 6: aggregate each column separately and merge

time.6 <- system.time({
  
  d_summary.6 <- setNames(as.data.frame(table(d$ID), stringsAsFactors = F),c("ID","N"))
  d_summary.6 <- merge(d_summary.6, setNames(aggregate(P ~ ID, data = d, FUN = mean),c("ID","P_mean")), by = "ID")
  d_summary.6 <- merge(d_summary.6, setNames(aggregate(P ~ ID, data = d, FUN = sd),c("ID","P_sd")), by = "ID")
  d_summary.6 <- merge(d_summary.6, setNames(aggregate(set ~ ID, data = d, FUN = function(x) {paste(unique(x),collapse=",")}),c("ID","sets")), by = "ID")
  
})

cat("\ntime.6 =",time.6,"\n")
print(sapply(d_summary.6, class))

# Method 7: aggregate each column separately and cbind (this assumes that both table and aggregate will report all values of ID, sorted)

time.7 <- system.time({
  
  d_summary.7 <- setNames(as.data.frame(table(d$ID), stringsAsFactors = F),c("ID","N"))
  d_summary.7 <- cbind(d_summary.7, "P_mean" = aggregate(P ~ ID, data = d, FUN = mean)[,2])
  d_summary.7 <- cbind(d_summary.7, "P_sd" = aggregate(P ~ ID, data = d, FUN = sd)[,2])
  d_summary.7 <- cbind(d_summary.7, "sets" = aggregate(set ~ ID, data = d, FUN = function(x) {paste(unique(x),collapse=",")})[,2])
  
})

cat("\ntime.7 =",time.7,"\n")
print(sapply(d_summary.7, class))

Timing:

time.1 = 1.73 0.02 1.77 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character" 

time.2.1 = 0.29 0 0.3 NA NA
time.2 = 1.83 0.01 1.84 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character" 

time.4 = 0.13 0 0.13 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character" 

time.5 = 0.08 0 0.08 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character" 

time.6 = 0.25 0 0.25 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character" 

time.7 = 0.25 0 0.25 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character" 
user6376297
  • 575
  • 2
  • 15

3 Answers3

3

You could use dplyr for this task:

library(dplyr)
d %>% 
  group_by(ID) %>% 
  summarise(N = n(),
            P_mean = mean(P),
            P_sd = sd(P),
            sets = paste(unique(set), collapse = ","))

returns

# A tibble: 2,553 x 5
   ID        N P_mean  P_sd sets     
   <chr> <int>  <dbl> <dbl> <chr>    
 1 A001      4   27.4  42.1 N,Z,C    
 2 A002      3   46.6  40.6 Z,R,L    
 3 A003      5   31.8  28.0 S,F,X,H,U
 4 A004      5   46.4  36.0 H,W,U,P,R
 5 A005      3   53.6  24.7 I,Y,B    
 6 A006      2   58.9  61.9 V,J      
 7 A007      5   68.2  53.8 Y,X,W,N,F
 8 A008      4   64.5  14.0 X,I,V,D  
 9 A009      1   61.4  NA   L        
10 A010      2   95.5  30.0 S,L      
# ... with 2,543 more rows

Compared (on my machine) to your other methods:

time.1 = 1.02 0 1.02 NA NA 

time.2.1 = 0.17 0 0.17 NA NA

time.2 = 1.11 0 1.11 NA NA 

# dplyr-method
time.3 = 0.07 0 0.08 NA NA 
         ID           N      P_mean        P_sd        sets 
"character"   "integer"   "numeric"   "numeric" "character" 
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • Thanks! Your method is 15-20 times faster than mine on the N = 10000 case. I usually refrain from using non-base R, but dplyr is a very popular package, so I might have to give in. I will post an update. – user6376297 Jul 24 '21 at 07:31
1

You can try data.table approach -

library(data.table)

setDT(d)

d[, .(N = .N, 
      P_mean = mean(P), 
      P_sd = sd(P), 
      sets = toString(unique(set))), ID]

#        ID N P_mean P_sd             sets
#   1: M074 6  66.30 32.1 I, O, K, S, W, Y
#   2: E016 4  60.23 25.3       E, Y, I, L
#   3: W043 3  46.62 46.2          Q, U, L
#   4: Y059 5  93.59 26.8    G, T, L, O, S
#   5: R073 7  61.16 44.1    N, P, M, I, S
#  ---                                    
#2549: B012 2   6.68 27.7             Z, G
#2550: H088 1  -4.08   NA                X
#2551: T052 1  27.65   NA                E
#2552: C087 1  74.33   NA                M
#2553: Q021 1  30.29   NA                P
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks! Same as for the other answer, your method is 15-20 times faster than mine on the N = 10000 case. I thought I had read somewhere that data.table was now used for all standard data.frame R handling; apparently not. I will post an update with a trial of your method, too. – user6376297 Jul 24 '21 at 07:32
1

Consider using collapse

library(collapse)
fpaste <- function(x) toString(funique(x))
out <- collap(d, ~ ID, custom = list(fnobs = "set",
      fmean = "P", fsd = "P", fpaste = "set"))

-output

head(out)
    ID fnobs.set    fpaste.set  fmean.P    fsd.P
1 A001         4       N, Z, C 27.43196 42.10786
2 A002         3       Z, R, L 46.57773 40.55696
3 A003         5 S, F, X, H, U 31.84874 27.96048
4 A004         5 H, W, U, P, R 46.37885 36.03823
5 A005         3       I, Y, B 53.62615 24.67470
6 A006         2          V, J 58.91548 61.88600

Benchmarks

 N <- 1000000
system.time({
out <- collap(d, ~ ID, custom = list(fnobs = "set",
      fmean = "P", fsd = "P", fpaste = "set"))
})
# user  system elapsed 
#  0.513   0.015   0.526 

system.time({
setDT(d)

d[, .(N = .N, 
      P_mean = mean(P), 
      P_sd = sd(P), 
      sets = toString(unique(set))), ID]

}) 

# user  system elapsed 
#  0.646   0.015   0.659 
akrun
  • 874,273
  • 37
  • 540
  • 662