2

Assume I have a list of length D containing data.table objects. Each data.table has the same columns (X, Y) and same number of rows N. I'd like to construct another table with N rows, with the individual rows taken from the tables specified by an index vector also of length N. Restated, each row in the final table is taken from one and only one of the tables in the array, with the index of the source table specified by an existing vector.

N = 100  # rows in each table (actual ~1000000 rows)
D = 4    # number of tables in array (actual ~100 tables)
tableArray = vector("list", D)
for (d in 1:D) {
  tableArray[[d]] = data.table(X=rnorm(N), Y=d)  # actual ~100 columns
}
tableIndexVector = sample.int(D, N, replace=TRUE) # length N of random 1:D
finalTable = copy(tableArray[[1]]) # just for length and column names
for (n in 1:N) {
  finalTable[n] = tableArray[[tableIndexVector[n]]][n]
} 

This seems to work the way I want, but the array within array notation is hard to understand, and I presume the performance of the for loop isn't going to be very good. It seems like there should be some elegant way of doing this, but I haven't stumbled across it yet. Is there another way of doing this that is efficient and less arcane?

(In case you are wondering, each table in the array represents simulated counterfactual observations for a subject under a particular regime of treatment, and I want to sample from these with different probabilities to test the behavior of different regression approaches with different ratios of regimes observed.)

Nathan Kurz
  • 1,649
  • 1
  • 14
  • 28

1 Answers1

3

for loops work just fine with data.table but we can improve the performance of your specific loop significantly (I believe) using the following approaches.

Approach # 1

  1. Use set instead, as it avoids the [.data.table overhead
  2. Don't loop over 1:N because you can simplify your loop to run only on unique values of tableIndexVector and assign all the corresponding values at once. This should decrease the run time by at least x10K (as N is of size 1MM and D is only of size 100, while unique(tableIndexVector) <= D)

So you basically could convert your loop to the following

for (i in unique(tableIndexVector)) {
  indx <- which(tableIndexVector == i)
  set(finalTable, i = indx, j = 1:2, value = tableArray[[i]][indx])
}

Approach # 2

Another approach is to use rbindlist and combine all the tables into one big data.table while adding the new idcol parameter in order to identify the different tables within the big table. You will need the devel version for that. This will avoid the loop as requested, but the result will be ordered by the tables appearance

temp <- rbindlist(tableArray, idcol = "indx")
indx <- temp[, .I[which(tableIndexVector == indx)], by = indx]$V1
finalTable <- temp[indx]

Here's a benchmark on bigger data set

N = 100000  
D = 10    
tableArray = vector("list", D)
set.seed(123)
for (d in 1:D) {
  tableArray[[d]] = data.table(X=rnorm(N), Y=d)  
}

set.seed(123)
tableIndexVector = sample.int(D, N, replace=TRUE) 
finalTable = copy(tableArray[[1]]) 
finalTable2 = copy(tableArray[[1]])

## Your approach
system.time(for (n in 1:N) {
  finalTable[n] = tableArray[[tableIndexVector[n]]][n]
})
#   user  system elapsed 
# 154.79   33.14  191.57     

## My approach # 1
system.time(for (i in unique(tableIndexVector)) {
  indx <- which(tableIndexVector == i)
  set(finalTable2, i = indx, j = 1:2, value = tableArray[[i]][indx])
})    
# user  system elapsed 
# 0.01    0.00    0.02

## My approach # 2
system.time({
  temp <- rbindlist(tableArray, idcol = "indx")
  indx <- temp[, .I[which(tableIndexVector == indx)], by = indx]$V1
  finalTable3 <- temp[indx]
})    
# user  system elapsed 
# 0.11    0.00    0.11 

identical(finalTable, finalTable2)
## [1] TRUE
identical(setorder(finalTable, X), setorder(finalTable3[, indx := NULL], X))
## [1] TRUE

So to conclusion

  • My first approach is by far the fastest and elapses x15K times faster than your original one. It is also returns identical result
  • My second approach is still x1.5K times faster than your original approach but avoids the loop (which you don't like for some reason). Though the result is order by the tables appearance, so the order isn't identical to your result.
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • That's great, thanks! It's certainly much faster than my naive approach, although still a little opaque. It inspired me to try"for (i in 1:N) { set(finalTable, i=i, j=1:C, tableArray[[tableIndexVector[[i]]]][i]) }", which I think is a touch clearer, but still slow and not actually that clear. I'm still hoping there might be a non-looping version akin to the non-working "finalTable[TRUE, 1:ncol(finalTable) := tableArray[[.I]][.I]]", although I haven't been able to find it. I'll accept your answer in a couple days if no one stops by with an even better version. – Nathan Kurz Aug 05 '15 at 19:36
  • `for (i in 1:N) { set(finalTable, i=i, j=1:C, tableArray[[tableIndexVector[[i]]]][i]) }` should be very slow too because it will run 1MM loops, and you shouldn't do this. I offered you a solution that runs 12K times faster than yours, so I can't understand what it is you straggling with. The only option to avoid a loop is to combine all the data sets together, but this won't be faster probably. There is no problem with running `set` in a loop. It is even documented btw. – David Arenburg Aug 05 '15 at 19:38
  • See my edit. I've added a non loop solution, but you will need the development version for that and also the result will be ordered. – David Arenburg Aug 05 '15 at 20:29
  • Do you know the joke with the punchline "He was wearing a hat"? 15K faster, and you think I"m satisfied? Actually, I ask about avoiding the loop because I'm new to data.table and trying to learn the idioms. Your answers are great, and I've definitely learned from them. My feeling that there might be another way is because if I were writing the loop in C, I could do the assignment in a branchless single pass without memory allocation, and it would probably be at least 2*D faster. Given the other ways data.table is optimized, it seemed possible that I was missing a way to match this. – Nathan Kurz Aug 05 '15 at 23:46
  • `set` within a `for` loop is the idiomatic `data.table` way. See comments from both @Arun and @MattDowle (the developers of the `data.table` package) who both recommend using `set` within `for` loop under [this answer](http://stackoverflow.com/a/16846530/3001626). Other than, I'm not going to spend any more time into convincing you, you can do or think whatever you want really. – David Arenburg Aug 05 '15 at 23:56
  • Btw, I also suggested a solution without a loop, but whatever. – David Arenburg Aug 06 '15 at 00:03
  • I apologize if I've somehow managed to offended you. I think you might not be reading my responses in the way I intend. I have no philosophical problem with calling set() in a loop. I'm in fact excited to learn about it, since my wrong impression was that set() had been superseded by :=, rather than being a recent optimization. Interestingly, I find your non-loop solution to be about twice as fast as the original for D=10 and N < 20000, slowing down to 5x slower at N=100000000. – Nathan Kurz Aug 06 '15 at 01:02