Extracting rows defined by several factors with a large number of levels

Question

What I would like to do

I have a data frame with several grouping factors and some other data. I would like to group the rows according to those factors and flag or extract all rows which belong to groups with more than one member.

The problem/question

I was able to come up with a solution (see example below) but the solution is not practical due to an inefficiency of interaction(). Even though drop = TRUE the running time of interaction() increases dramatically when the number of levels increases. Ultimately I would like to process 10 - 20 factors with up to 50'000 levels on a data.frame with a few hundred thousand rows.

Questions: 1) What is the most efficient approach to this problem? ("Efficient" measured in this order by execution time, memory requirement and readability of code)

Question 2) What is wrong with interaction()?

The example

# number of rows
nobs <- 100000
# number of levels
nlvl <- 5000

#create two factors with a decent number of levels
fc1 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
fc2 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
#package in a data.frame together with some arbitrary data
wdf <- data.frame(fc1, fc2, vals = sample.int(2, size = nobs, replace = TRUE))
#just for information: number of unique combinations of factors, i.e. groups
ngroups <- nrow(unique(wdf[,1:2]))
print(ngroups)

#granular grouping, tt has nobs elements and ngroups levels
tt <- interaction(wdf[,1:2], drop = TRUE)

#grpidx contains for each row the corresponding group (i.e. level of tt)
#observe that length(grpidx) == nobs and max(grpidx) == ngroups
grpidx <- match(tt, levels(tt))
#split into list of groups (containing row indices)
grplst <- split(seq_along(grpidx), grpidx)
#flag groups with more than one member
flg_dup <- vapply(grplst, FUN = function(x)length(x)>1, FUN.VALUE = TRUE)
#collect all row indices of groups with more than one member
dupidx <- unlist(grplst[flg_dup])
#select the corresponding rows
nonunqdf <- cbind(grpidx[dupidx], wdf[dupidx,])

Timing of the line tt <- interaction(wdf[,1:2], drop = TRUE)

nlvl == 500: 82 milliseconds
nlvl == 5000: 28 seconds
nlvl == 10000: 233 seconds

`library(data.table); setDT(wdf)[, if (.N > 1) c(g = .GRP, .SD), by=.(fc1, fc2)]` is fairly fast. This is a pretty common question.. going to search... — Frank, Feb 28 '17 at 16:48
Here's one: http://stackoverflow.com/q/32259620/ I don't think it's a duplicate, though, since you're asking about performance here. — Frank, Feb 28 '17 at 16:54
Even more pertinent: This (stackoverflow.com/q/32259620) is about a *single* factor, my problem are the multiple or joint factors. — g g, Feb 28 '17 at 16:58
Actually, the single vs multiple factor thing doesn't matter as much as you might think in terms of good approaches to this problem. `duplicated(x, by=c("col1","col2"))`, `ave(x, col1, col2, FUN = length)`, `group_by(col1, col2)` ... — Frank, Feb 28 '17 at 17:15
I see where you are coming from but it was killing my approach. And BTW, I tried the magic one-liner above. My conclusion: lightning fast and works as advertised! The only thing left it seems, is to understand what the line actually means. — g g, Feb 28 '17 at 17:19

Frank · Accepted Answer · 2017-02-28T17:59:35.800

Using data.table (with example size nobs = 1e5; nlvl = 5e3 as in the OP)...

library(data.table)
setDT(wdf) # convert to data.table in place

system.time(
  res <- wdf[, if (.N > 1) c(g = .GRP, .SD), by=.(fc1, fc2)]
)
# 0.04 seconds

DT[i, j, by] means "filter by i, group by by, then do j".

So in this case we are

grouping by fc1, fc2
counting rows in each group, .N
if there are enough rows, returning group counter .GRP along with the subset of data, .SD

See ?data.table for general coverage of the notation and ?.N regarding the special symbols.

I recommend visiting the website and going through the vignettes to start with the package.

Alternatives. This way retains the original row ordering:

system.time(res2 <- wdf[, `:=`(g = .GRP, n = .N), by=.(fc1, fc2)][n > 1L])
# 0.06 seconds

And this base R way fails:

system.time(res3 <- wdf[ave(vals, fc1, fc2, FUN = length) > 1])
# causes R to freeze while eating all my RAM... 
# probably because of too many factor combos

Extracting rows defined by several factors with a large number of levels

What I would like to do

The problem/question

The example

1 Answers1