1

What is the most efficient way to calculate a group index (group identifier) across multiple columns in a data frame or data.table in R?

For example, in the data frame below, there are six unique combinations of columns A and B.

DF <- data.frame(a = rep(1:2,6), b = sort(letters[1:3]))

> DF 
   a b
1  1 a
2  2 b
3  1 c
4  2 a
5  1 b
6  2 c
7  1 a
8  2 b
9  1 c
10 2 a
11 1 b
12 2 c

I'd like to add column 'index' with a group identifier, like the one produced by this (obviously inefficient method for large data frames):

DF$index <- with(DF, as.numeric(factor(paste0(a, b))))

> DF
   a b index
1  1 a     1
2  2 b     5
3  1 c     3
4  2 a     4
5  1 b     2
6  2 c     6
7  1 a     1
8  2 b     5
9  1 c     3
10 2 a     4
11 1 b     2
12 2 c     6

What's the fastest way to do this with very large data frames?

Chris Holbrook
  • 2,531
  • 1
  • 17
  • 30
  • For very large data use `data.table` https://stackoverflow.com/tags/data.table/info – jogo Jun 08 '17 at 17:42

2 Answers2

2

How about this using data.table,

library(data.table)
setDT(df)[,group :=.GRP,by = .(a,b)]

Output:

> df
    a b group
 1: 1 a     1
 2: 2 b     2
 3: 1 c     3
 4: 2 a     4
 5: 1 b     5
 6: 2 c     6
 7: 1 a     1
 8: 2 b     2
 9: 1 c     3
10: 2 a     4
11: 1 b     5
12: 2 c     6
PKumar
  • 10,971
  • 6
  • 37
  • 52
1

In base R, you can use interaction, which is quite a bit faster than the paste method, although the data.table method is faster still.

DF$index <- as.integer(interaction(DF))

This returns the desired result

DF
   a b index
1  1 a     1
2  2 b     4
3  1 c     5
4  2 a     2
5  1 b     3
6  2 c     6
7  1 a     1
8  2 b     4
9  1 c     5
10 2 a     2
11 1 b     3
12 2 c     6

timings

Here are some timings on a larger dataset:

### set up
# 60K observations
DF <- data.frame(a = rep(1:2,60000), b = letters[1:20])
# make a data table copy
library(data.table)
DT <- data.table(DF)

library(microbenchmark)
microbenchmark(paste=with(DF, as.numeric(factor(paste0(a, b)))),
               interaction=as.integer(interaction(DF)),
               grp=DT[,group :=.GRP,by = .(a,b)])
Unit: milliseconds
        expr       min        lq      mean    median        uq      max neval cld
       paste 38.664541 41.100587 46.315671 42.030301 42.903709 91.32412   100   c
 interaction  4.203244  5.788548  9.927459  6.141646  6.943635 55.15564   100  b 
         grp  1.771617  1.897632  2.772984  2.138828  2.218371 49.41399   100 a 
lmo
  • 37,904
  • 9
  • 56
  • 69