I'm having some trouble aggregating a data frame while keeping the groups in their original order (order based on first appearance in data frame). I've managed to get it right, but was hoping there is an easier way to go about it.
Here is a sample data set to work on:
set.seed(7)
sel.1 <- sample(1:5, 20, replace = TRUE) # selection vector 1
sel.2 <- sample(1:5, 20, replace = TRUE)
add.1 <- sample(81:100) # additional vector 1
add.2 <- sample(81:100)
orig.df <- data.frame(sel.1, sel.2, add.1, add.2)
Some points to note: there are two selection columns to determine how the data is grouped together. They will be the same, and their names are known. I have only put two additional columns in this data, but there may be more. I have given the columns names starting with 'sel' and 'add' to make it easier to follow, but the actual data has different names (so while grep
tricks are cool, they won't be useful here).
What I'm trying to do is aggregate the data frame into groups based on the 'sel' columns, and to sum together all the 'add' columns. This is simple enough using aggregate
as follows:
# Get the names of all the additional columns
all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]
aggr.df <- aggregate(orig.df[,all.add],
by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)
The problem is that the result is ordered by the 'sel' columns; I want it ordered based on each group's first appearance in the original data.
Here are my best attempts at making this work:
## Attempt 1
# create indices for each row (x) and find the minimum index for each range
index.df <- aggregate(x = 1:nrow(orig.df),
by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), min)
# Make sure the x vector (indices) are in the right range for aggr.df
index.order <- (1:nrow(index.df))[order(index.df$x)]
aggr.df[index.order,]
## Attempt 2
# get the unique groups. These are in the right order.
unique.sel <- unique(orig.df[,c("sel.1", "sel.2")])
# use sapply to effectively loop over data and sum additional columns.
sums <- t(sapply(1:nrow(unique.sel), function (x) {
sapply(all.add, function (y) {
sum(aggr.df[which(aggr.df$sel.1 == unique.sel$sel.1[x] &
aggr.df$sel.2 == unique.sel$sel.2[x]), y])
})
}))
data.frame(unique.sel, sums)
While these give me the right result, I was hoping that somebody could point out a simpler solution. It would be preferable if the solution works with the packages that come with the standard R installation.
I've looked at the the documentation for aggregate
and match
, but I couldn't find an answer (I guess I was hoping for something like a "keep.original.order" parameter for aggregate
).
Any help would be much appreciated!
Update: (in case anybody stumbles across this)
Here is the cleanest way that I could find after trying for a few more days:
unique(data.frame(sapply(names(orig.df), function(x){
if(x %in% c("sel.1", "sel.2")) orig.df[,x] else
ave(orig.df[,x], orig.df$sel.1, orig.df$sel.2, FUN=sum)},
simplify=FALSE)))