I have a list of data frames
df1 <- data.frame(Symbol = c("A", "A", "B", "C", "D", "D", "D", "E", "E", "A"),
Sample1 = sample(50, 10),
Sample2 = sample(50, 10),
Sample3 = sample(50, 10))
df2 <- data.frame(Symbol = c("B", "B", "B", "E", "A"),
Sample4 = sample(50, 5),
Sample5 = sample(50, 5),
Sample6 = sample(50, 5))
df3 <- data.frame(Symbol = c("A", "B", "C", "D", "E", "F"),
Sample7 = sample(50, 6),
Sample8 = sample(50, 6),
Sample9 = sample(50, 6))
As you can see, each data frame has a column called Symbol
, which contains values that appear in multiple data frames in the list. I'd like to be able to consolidate them so I can select some basic stats (e.g. mean, 25th percentile, etc...). Initially, I simply did this
dfList <- list(df1, df2, df3)
df <- reduce(dfList, merge, "Symbol", all = T)
thinking that I'd just take the mean of each row. However, I didn't know that symbols were appearing multiple times in each data frame. So the above line actually consolidated each occurrence of say A
in the first data frame, with each occurrence of A
in the second and third data frames. This was not what I wanted.
The messy solution is to just get the unique Symbol
identifiers from the entire list, and write an lapply
over them in which I search for each identifier in each data frame, and extract the values. I'll then have a list of vectors, with the names being the Symbol identifiers. Something like this
syms <- unique(as.character(unlist(lapply(dfList, function(x) return(x$Symbol)))))
lst <- lapply(syms, function(x) {
return(unlist(lapply(dfList, function(y) {
return(unlist(y[grep(x, y$Symbol),-1]))
})))
})
names(lst) <- syms
Then I can just take the mean, median, etc...
This is very slow on my actual data frame, and I just think there must be an easier way to do this.
Edit: So in the end, I'd like to have somethign like this:
Mean = t(as.data.frame(lapply(lst, function(x) {
return(mean(x))
})))
Min = t(as.data.frame(lapply(lst, function(x) {
return(min(x))
})))
Quantiles = t(as.data.frame(lapply(lst, function(x) {
return(quantile(x, c(0.25, 0.5, 0.75)))
})))
Max = t(as.data.frame(lapply(lst, function(x) {
return(max(x))
})))
df <- cbind(Mean, Min, Quantiles, Max)
colnames(df) <- c("Mean", "Min", "p25", "p50", "p75", "Max")
This will produce
> df
Mean Min p25 p50 p75 Max
A 27.53333 13 21.50 27.0 32.50 48
B 24.33333 2 11.00 21.0 37.50 48
C 26.50000 3 24.00 26.5 32.75 45
D 25.00000 2 12.25 23.5 38.75 49
E 23.25000 2 12.75 22.0 33.75 45
F 21.00000 11 11.50 12.0 26.00 40