11

I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:

  1. Look up the identifier ID in row i (ID(i))
  2. Look up the value of T2 in row i (T2(i))
  3. Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)
  4. Enter the calculated average in the column Data2 of row i

     DF = data.frame(ID=rep(c("a","b"),each=6), 
                 T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
     DT = data.table(DF)
     DT[ , Data2:=NA_real_]
         ID T1 T2  Data1 Data2
    [1,]  a  1  1     1    NA
    [2,]  a  1  2     2    NA
    [3,]  a  1  3     3    NA
    [4,]  a  2  1     4    NA
    [5,]  a  2  2     5    NA
    [6,]  a  2  3     6    NA
    [7,]  b  1  1     7    NA
    [8,]  b  1  2     8    NA
    [9,]  b  1  3     9    NA
    [10,] b  2  1    10    NA
    [11,] b  2  2    11    NA
    [12,] b  2  3    12    NA
    

For this simple example the result should look like this:

      ID T1 T2  Data1 Data2
[1,]  a  1  1     1    2
[2,]  a  1  2     2    5
[3,]  a  1  3     3    NA
[4,]  a  2  1     4    2
[5,]  a  2  2     5    5
[6,]  a  2  3     6    NA
[7,]  b  1  1     7    8
[8,]  b  1  2     8    11
[9,]  b  1  3     9    NA
[10,] b  2  1    10    8
[11,] b  2  2    11    11
[12,] b  2  3    12    NA

I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Cake
  • 177
  • 2
  • 9
  • 2
    You written specification appears difficult to operationalize, but your example suggests that within each ID group you want the means of some group of values for which T2 is in the range of values of T1. But even that interpretation falls apart when trying to figure out why Data2 in the second row should be 5. ????? – IRTFM Mar 23 '12 at 17:48
  • @DWin that's because the averaging is done on the `Data1` column. `Data2[2]` equals 5 because 5 is the average for `(4, 5, 6)`. – ulidtko Mar 23 '12 at 17:57

3 Answers3

11

The rule of thumb is to aggregate first, and then join to that.

agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
      ID T1 T2 Data1 Data2
 [1,]  a  1  1     1     2
 [2,]  a  1  2     2     5
 [3,]  a  1  3     3    NA
 [4,]  a  2  1     4     2
 [5,]  a  2  2     5     5
 [6,]  a  2  3     6    NA
 [7,]  b  1  1     7     8
 [8,]  b  1  2     8    11
 [9,]  b  1  3     9    NA
[10,]  b  2  1    10     8
[11,]  b  2  2    11    11
[12,]  b  2  3    12    NA

As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.

keyby has been added to v1.8.0 on R-Forge so that avoids the need for the setkey, too.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Thank you Matthew. This is incredibly fast. Is there a possibility to give the `V1` column in `agg` a custom name right when creating it to avoid any confusion about column names? – Cake Mar 23 '12 at 20:24
  • 1
    Try `DT[,list(myname=mean(Data1)),by=list(ID,T1)]`. Also see [data.table wiki](http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) point 3, for a further speedup in this case. – Matt Dowle Mar 23 '12 at 21:01
  • I replaced your third line with `DT[,Data2:={agg[J(ID, T2)][[3]]}]`, and get the same results. i.e, I **did** avoid the `JT=` bit (as well at the `,V1`). Are either those bad practice on my part? – Josh O'Brien Mar 23 '12 at 22:49
  • @Josh Hi. I tried that line but it doesn't seem to be the same result. The `ID` inside the `J()` is from `agg`, recycled to match the length of `T2` (from `DT` since `T2` isn't in `agg`) so it mixes up the `a`'s and `b`'s. But it's ok to avoid the `V1` in this case, maybe even more efficient. – Matt Dowle Mar 23 '12 at 23:40
2

A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.

R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12)) 
R> d
   ID T1 T2 Data1
1   a  1  1     1
2   a  1  2     2
3   a  1  3     3
4   a  2  1     4
5   a  2  2     5
6   a  2  3     6
7   b  1  1     7
8   b  1  2     8
9   b  1  3     9
10  b  2  1    10
11  b  2  2    11
12  b  2  3    12

R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
   ID T1 T2 Data1 Data2
1   a  1  1     1     2
2   a  1  2     2     5
3   a  1  3     3   NaN
4   a  2  1     4     2
5   a  2  2     5     5
6   a  2  3     6   NaN
7   b  1  1     7     8
8   b  1  2     8    11
9   b  1  3     9   NaN
10  b  2  1    10     8
11  b  2  2    11    11
12  b  2  3    12   NaN

Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.

R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.

ulidtko
  • 14,740
  • 10
  • 56
  • 88
1

Using tapply and part of another recent post:

DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))

EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:

ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)

i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))

DF<-cbind(DF,Data2 = ansMat[i])


# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
#   curSub <- DF[x, ]
#   myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
#   meanData1 <- mean(curSub$Data1)
#   return(meanData1 = meanData1)
# })

The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?

BenBarnes
  • 19,114
  • 6
  • 56
  • 74