What I would like to do
I have a data frame with several grouping factors and some other data. I would like to group the rows according to those factors and flag or extract all rows which belong to groups with more than one member.
The problem/question
I was able to come up with a solution (see example below) but the solution is not practical due to an inefficiency of interaction()
. Even though drop = TRUE
the running time of interaction()
increases dramatically when the number of levels increases. Ultimately I would like to process 10 - 20 factors with up to 50'000 levels on a data.frame with a few hundred thousand rows.
Questions: 1) What is the most efficient approach to this problem? ("Efficient" measured in this order by execution time, memory requirement and readability of code)
Question 2) What is wrong with interaction()
?
The example
# number of rows
nobs <- 100000
# number of levels
nlvl <- 5000
#create two factors with a decent number of levels
fc1 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
fc2 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
#package in a data.frame together with some arbitrary data
wdf <- data.frame(fc1, fc2, vals = sample.int(2, size = nobs, replace = TRUE))
#just for information: number of unique combinations of factors, i.e. groups
ngroups <- nrow(unique(wdf[,1:2]))
print(ngroups)
#granular grouping, tt has nobs elements and ngroups levels
tt <- interaction(wdf[,1:2], drop = TRUE)
#grpidx contains for each row the corresponding group (i.e. level of tt)
#observe that length(grpidx) == nobs and max(grpidx) == ngroups
grpidx <- match(tt, levels(tt))
#split into list of groups (containing row indices)
grplst <- split(seq_along(grpidx), grpidx)
#flag groups with more than one member
flg_dup <- vapply(grplst, FUN = function(x)length(x)>1, FUN.VALUE = TRUE)
#collect all row indices of groups with more than one member
dupidx <- unlist(grplst[flg_dup])
#select the corresponding rows
nonunqdf <- cbind(grpidx[dupidx], wdf[dupidx,])
Timing of the line tt <- interaction(wdf[,1:2], drop = TRUE)
- nlvl == 500: 82 milliseconds
- nlvl == 5000: 28 seconds
- nlvl == 10000: 233 seconds