R - Consolidate values of identifier that are distributed in a list of data frames

Question

I have a list of data frames

df1 <- data.frame(Symbol = c("A", "A", "B", "C", "D", "D", "D", "E", "E", "A"),
                     Sample1 = sample(50, 10),
                     Sample2 = sample(50, 10),
                     Sample3 = sample(50, 10))
df2 <- data.frame(Symbol = c("B", "B", "B", "E", "A"),
                  Sample4 = sample(50, 5),
                  Sample5 = sample(50, 5),
                  Sample6 = sample(50, 5))
df3 <- data.frame(Symbol = c("A", "B", "C", "D", "E", "F"),
                  Sample7 = sample(50, 6),
                  Sample8 = sample(50, 6),
                  Sample9 = sample(50, 6))

As you can see, each data frame has a column called Symbol, which contains values that appear in multiple data frames in the list. I'd like to be able to consolidate them so I can select some basic stats (e.g. mean, 25th percentile, etc...). Initially, I simply did this

dfList <- list(df1, df2, df3)
df <- reduce(dfList, merge, "Symbol", all = T)

thinking that I'd just take the mean of each row. However, I didn't know that symbols were appearing multiple times in each data frame. So the above line actually consolidated each occurrence of say A in the first data frame, with each occurrence of A in the second and third data frames. This was not what I wanted.

The messy solution is to just get the unique Symbol identifiers from the entire list, and write an lapply over them in which I search for each identifier in each data frame, and extract the values. I'll then have a list of vectors, with the names being the Symbol identifiers. Something like this

syms <- unique(as.character(unlist(lapply(dfList, function(x) return(x$Symbol)))))
lst <- lapply(syms, function(x) {
  return(unlist(lapply(dfList, function(y) {
    return(unlist(y[grep(x, y$Symbol),-1]))
  })))
})
names(lst) <- syms

Then I can just take the mean, median, etc...

This is very slow on my actual data frame, and I just think there must be an easier way to do this.

Edit: So in the end, I'd like to have somethign like this:

Mean = t(as.data.frame(lapply(lst, function(x) {
  return(mean(x))
})))
Min = t(as.data.frame(lapply(lst, function(x) {
  return(min(x))
})))
Quantiles = t(as.data.frame(lapply(lst, function(x) {
  return(quantile(x, c(0.25, 0.5, 0.75)))
})))
Max = t(as.data.frame(lapply(lst, function(x) {
  return(max(x))
})))
df <- cbind(Mean, Min, Quantiles, Max)
colnames(df) <- c("Mean", "Min", "p25", "p50", "p75", "Max")

This will produce

> df
      Mean Min   p25  p50   p75 Max
A 27.53333  13 21.50 27.0 32.50  48
B 24.33333   2 11.00 21.0 37.50  48
C 26.50000   3 24.00 26.5 32.75  45
D 25.00000   2 12.25 23.5 38.75  49
E 23.25000   2 12.75 22.0 33.75  45
F 21.00000  11 11.50 12.0 26.00  40

I suggest that you back up a few steps and get the data into tall (tidy) format. — Bill O'Brien, Sep 09 '19 at 17:03
@Parfait: Sorry about that. I have edited my question to include what I wanted the result to be — Zuhaib Ahmed, Sep 09 '19 at 17:06
So you need to combine all samples into **one** sample for aggregations? — Parfait, Sep 09 '19 at 17:06
Thanks. It seems to work. Do you think you'd be able to provide an explanation for it? I have no idea what's happening in some of these arguments. — Zuhaib Ahmed, Sep 09 '19 at 17:16

score 1 · Accepted Answer · answered Sep 09 '19 at 17:18

Consider the chain merge by Symbol followed by reshape from wide to long format and aggregate across multiple functions:

# MERGE ALL BY Symbol
dfList <- list(df1, df2, df3)
mdf <- Reduce(function(...) merge(..., by="Symbol", all=TRUE), dfList)

# RESHAPE WIDE TO LONG
rdf <- reshape(mdf, varying=names(mdf)[-1], times=names(mdf)[-1],
               v.name = "Value", timevar="Sample",
               new.row.names = 1:1E4, direction = "long")

# AGGREGATE DATA
agg_raw <- aggregate(Value ~ Symbol, rdf, 
                     function(x) c(Mean=mean(x), Min=min(x), p25=unname(quantile(x)[2]), 
                                   p50=median(x), p75=unname(quantile(x)[4]), Max=max(x)))

agg_df <- do.call(data.frame, agg_raw)
colnames(agg_df) <- gsub("Value.", "", colnames(agg_df))
agg_df

#   Symbol     Mean Min   p25  p50   p75 Max
# 1      A 30.77778   3 26.00 33.0 38.00  47
# 2      B 25.29630   3 16.00 23.0 30.50  47
# 3      C 23.00000   4 14.25 24.0 33.75  38
# 4      D 30.77778   2 17.75 34.0 43.00  49
# 5      E 25.38889   3 15.00 22.5 39.75  50
# 6      F 15.66667  13 13.00 13.0 17.00  21

Rextester demo

score 1 · Answer 2 · answered Sep 09 '19 at 17:29

A slightly different possibility could be:

df_agg <- Reduce(function(...) merge(..., all = TRUE), list(df1, df2, df3))
df_agg <- data.frame(df_agg[1], stack(df_agg[-1]))

aggregate(values ~ Symbol, df_agg, summary)

  Symbol values.Min. values.1st Qu. values.Median values.Mean values.3rd Qu.
1      A     1.00000       11.50000      18.00000    22.25926       34.00000
2      B     5.00000       27.00000      36.00000    34.44444       40.50000
3      C     6.00000        9.75000      25.50000    27.33333       45.75000
4      D     8.00000       20.00000      28.50000    26.00000       36.00000
5      E     2.00000       10.25000      23.00000    22.88889       35.00000
6      F     9.00000       11.50000      14.00000    17.33333       21.50000
  values.Max.
1    42.00000
2    50.00000
3    50.00000
4    43.00000
5    48.00000
6    29.00000

score 1 · Answer 3 · answered Sep 09 '19 at 19:02

d = do.call(rbind, lapply(list(df1, df2, df3), reshape2::melt, id.vars = "Symbol"))
doBy::summaryBy(value ~ Symbol, d, FUN = function(x){
    c(Mean = mean(x),
      Min = min(x),
      quantile(x, c(0.25, 0.5, 0.75)),
      Max = max(x))
})
#  Symbol value.Mean value.Min value.25% value.50% value.75% value.Max
#1      A   28.93333         4     13.50      36.0      40.0        48
#2      B   24.86667         1     12.00      28.0      35.5        47
#3      C   20.00000         9     10.25      18.5      26.0        38
#4      D   32.25000         1     21.00      36.0      45.5        50
#5      E   32.50000        13     29.75      32.5      39.5        49
#6      F   28.66667        13     22.00      31.0      36.5        42

score 0 · Answer 4 · answered Sep 09 '19 at 17:14

Does this get you close to what you need?

library(dplyr)
library(tidyr)

df1 <- data.frame(Symbol = c("A", "A", "B", "C", "D", "D", "D", "E", "E", "A"),
                  Sample1 = sample(50, 10),
                  Sample2 = sample(50, 10),
                  Sample3 = sample(50, 10))
df2 <- data.frame(Symbol = c("B", "B", "B", "E", "A"),
                  Sample4 = sample(50, 5),
                  Sample5 = sample(50, 5),
                  Sample6 = sample(50, 5))
df3 <- data.frame(Symbol = c("A", "B", "C", "D", "E", "F"),
                  Sample7 = sample(50, 6),
                  Sample8 = sample(50, 6),
                  Sample9 = sample(50, 6))

df1.long <- gather(df1, key = "Sample", value = "Value", -Symbol)
df2.long <- gather(df2, key = "Sample", value = "Value", -Symbol)
df3.long <- gather(df3, key = "Sample", value = "Value", -Symbol)

df.long <- bind_rows(df1.long, df2.long, df3.long)
df.long$Symbol <- factor(df.long$Symbol)
df.long$Sample <- factor(df.long$Sample)

df.long %>%
  group_by(Symbol, Sample) %>%
  summarize(mean = mean(Value),
            sd = sd(Value)) %>%
  print(n = nrow(.))

# # A tibble: 42 x 4
# # Groups:   Symbol [6]
# Symbol Sample   mean    sd
# <fct>  <fct>   <dbl> <dbl>
#   1 A      Sample1  19.3 12.2 
# 2 A      Sample2  30   15.6 
# 3 A      Sample3  18.3 14.5 
# 4 A      Sample4  14   NA   
# 5 A      Sample5  21   NA   
# 6 A      Sample6   2   NA   
# 7 A      Sample7   1   NA   
# 8 A      Sample8  42   NA   
# 9 A      Sample9  50   NA   
# 10 B      Sample1  12   NA   
# 11 B      Sample2  42   NA   
# 12 B      Sample3   3   NA   
# 13 B      Sample4  17.7 13.8 
# 14 B      Sample5  27.3 20.8 
# 15 B      Sample6  20   13.1 
# 16 B      Sample7  38   NA   
# 17 B      Sample8  30   NA   
# 18 B      Sample9  20   NA   
# 19 C      Sample1  32   NA   
# 20 C      Sample2  45   NA   
# 21 C      Sample3  35   NA   
# 22 C      Sample7   4   NA   
# 23 C      Sample8  48   NA   
# 24 C      Sample9  38   NA   
# 25 D      Sample1  28.3 22.0 
# 26 D      Sample2  24.7 22.5 
# 27 D      Sample3  20   22.5 
# 28 D      Sample7  14   NA   
# 29 D      Sample8  36   NA   
# 30 D      Sample9  36   NA   
# 31 E      Sample1  33   18.4 
# 32 E      Sample2  22    4.24
# 33 E      Sample3  33    8.49
# 34 E      Sample4  39   NA   
# 35 E      Sample5  43   NA   
# 36 E      Sample6  28   NA   
# 37 E      Sample7  33   NA   
# 38 E      Sample8  18   NA   
# 39 E      Sample9  14   NA   
# 40 F      Sample7  20   NA   
# 41 F      Sample8   2   NA   
# 42 F      Sample9  23   NA

I'm sorry for the ambiguity. I've edited my question to include what I want the final output to look like. — Zuhaib Ahmed, Sep 09 '19 at 17:17

R - Consolidate values of identifier that are distributed in a list of data frames

4 Answers4