2

What I would like to do

I have a data frame with several grouping factors and some other data. I would like to group the rows according to those factors and flag or extract all rows which belong to groups with more than one member.

The problem/question

I was able to come up with a solution (see example below) but the solution is not practical due to an inefficiency of interaction(). Even though drop = TRUE the running time of interaction() increases dramatically when the number of levels increases. Ultimately I would like to process 10 - 20 factors with up to 50'000 levels on a data.frame with a few hundred thousand rows.

Questions: 1) What is the most efficient approach to this problem? ("Efficient" measured in this order by execution time, memory requirement and readability of code)

Question 2) What is wrong with interaction()?

The example

# number of rows
nobs <- 100000
# number of levels
nlvl <- 5000

#create two factors with a decent number of levels
fc1 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
fc2 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
#package in a data.frame together with some arbitrary data
wdf <- data.frame(fc1, fc2, vals = sample.int(2, size = nobs, replace = TRUE))
#just for information: number of unique combinations of factors, i.e. groups
ngroups <- nrow(unique(wdf[,1:2]))
print(ngroups)

#granular grouping, tt has nobs elements and ngroups levels
tt <- interaction(wdf[,1:2], drop = TRUE)

#grpidx contains for each row the corresponding group (i.e. level of tt)
#observe that length(grpidx) == nobs and max(grpidx) == ngroups
grpidx <- match(tt, levels(tt))
#split into list of groups (containing row indices)
grplst <- split(seq_along(grpidx), grpidx)
#flag groups with more than one member
flg_dup <- vapply(grplst, FUN = function(x)length(x)>1, FUN.VALUE = TRUE)
#collect all row indices of groups with more than one member
dupidx <- unlist(grplst[flg_dup])
#select the corresponding rows
nonunqdf <- cbind(grpidx[dupidx], wdf[dupidx,])

Timing of the line tt <- interaction(wdf[,1:2], drop = TRUE)

  • nlvl == 500: 82 milliseconds
  • nlvl == 5000: 28 seconds
  • nlvl == 10000: 233 seconds
Frank
  • 66,179
  • 8
  • 96
  • 180
g g
  • 304
  • 2
  • 13
  • `library(data.table); setDT(wdf)[, if (.N > 1) c(g = .GRP, .SD), by=.(fc1, fc2)]` is fairly fast. This is a pretty common question.. going to search... – Frank Feb 28 '17 at 16:48
  • Here's one: http://stackoverflow.com/q/32259620/ I don't think it's a duplicate, though, since you're asking about performance here. – Frank Feb 28 '17 at 16:54
  • Even more pertinent: This (stackoverflow.com/q/32259620) is about a *single* factor, my problem are the multiple or joint factors. – g g Feb 28 '17 at 16:58
  • Actually, the single vs multiple factor thing doesn't matter as much as you might think in terms of good approaches to this problem. `duplicated(x, by=c("col1","col2"))`, `ave(x, col1, col2, FUN = length)`, `group_by(col1, col2)` ... – Frank Feb 28 '17 at 17:15
  • I see where you are coming from but it was killing my approach. And BTW, I tried the magic one-liner above. My conclusion: lightning fast and works as advertised! The only thing left it seems, is to understand what the line actually means. – g g Feb 28 '17 at 17:19
  • Ok cool, I'll post it and try to explain. – Frank Feb 28 '17 at 17:33

1 Answers1

2

Using data.table (with example size nobs = 1e5; nlvl = 5e3 as in the OP)...

library(data.table)
setDT(wdf) # convert to data.table in place

system.time(
  res <- wdf[, if (.N > 1) c(g = .GRP, .SD), by=.(fc1, fc2)]
)
# 0.04 seconds

DT[i, j, by] means "filter by i, group by by, then do j".

So in this case we are

  1. grouping by fc1, fc2

  2. counting rows in each group, .N

  3. if there are enough rows, returning group counter .GRP along with the subset of data, .SD

See ?data.table for general coverage of the notation and ?.N regarding the special symbols.

I recommend visiting the website and going through the vignettes to start with the package.


Alternatives. This way retains the original row ordering:

system.time(res2 <- wdf[, `:=`(g = .GRP, n = .N), by=.(fc1, fc2)][n > 1L])
# 0.06 seconds 

And this base R way fails:

system.time(res3 <- wdf[ave(vals, fc1, fc2, FUN = length) > 1])
# causes R to freeze while eating all my RAM... 
# probably because of too many factor combos
Frank
  • 66,179
  • 8
  • 96
  • 180