2

This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.

I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame

A list of Indices from that Frame : IndexList (55000 [numIndex] unique indices)

Its a time series so there are ~ 5 Million rows for 55K unique indices.

The OriginalDataFrame has been ordered by dataIndex. All the indices in IndexList are not present in OriginalDataFrame. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame

Currently I am running this code using library(foreach):

FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
  OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}

I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.

Am I doing something exceedingly silly or are there better ways in R to do this?

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
SGH
  • 313
  • 1
  • 2
  • 9
  • 2
    Are you looking for `OriginalDataFrame[OriginalDataFrame$dataIndex %in% unlist(IndexList)),]`? – Roland Aug 07 '13 at 10:50
  • see http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r or write the parts where you need performance in RCpp – maximus Aug 07 '13 at 10:56
  • Hi Roland, Thanks for that answer. Your solution was my first attempt without parallelization. It took over 26 hours to complete this operation with that code. After that I used the multicore version. – SGH Aug 07 '13 at 11:19
  • @ Maximus : Thanks for that suggestion. I am trying to explore vectorization operations in R first without taking a recourse to Cpp. Unfortunately our R runs on windows and I cannot use the GPU packages available for the Linux versions of R. – SGH Aug 07 '13 at 11:21

3 Answers3

3

Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)

library(data.table)
library(microbenchmark)
options(digits = 3)

# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))

# Data table, with index
dt <- data.table(df)
setkey(dt, "id")

ids <- sample(1e5, 1e4)

microbenchmark(
  df[df$id %in% ids , ], # won't preserve order
  df[match(ids, df$id), ],
  dt[id %in% ids, ],
  dt[match(ids, id), ],
  dt[.(ids)]
)
# Unit: milliseconds
#                     expr   min    lq median    uq   max neval
#     df[df$id %in% ids, ] 13.61 13.99  14.69 17.26 53.81   100
#  df[match(ids, df$id), ] 16.62 17.03  17.36 18.10 21.22   100
#        dt[id %in% ids, ]  7.72  7.99   8.35  9.23 12.18   100
#     dt[match(ids, id), ] 16.44 17.03  17.36 17.77 61.57   100
#               dt[.(ids)]  1.93  2.16   2.27  2.43  5.77   100

I had originally thought you might also be able to do this with rownames, which I thought built up a hash table and did the indexing efficiently. But that's obviously not the case:

df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],

microbenchmark(
  df[df$id %in% ids , ], # won't preserve order
  df2[as.character(ids), ],
  times = 1
)
# Unit: milliseconds
#                     expr    min     lq median     uq    max neval
#     df[df$id %in% ids, ]   15.3   15.3   15.3   15.3   15.3     1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8     1
hadley
  • 102,019
  • 32
  • 183
  • 245
0

Check data.table package. It works just like data.frame but faster.

Like this (where df is your data frame):

table <- data.table(df)

and use table

bartektartanus
  • 15,284
  • 6
  • 74
  • 102
  • 1
    I agree. However, you should expend this (e.g., show how to do it with data.table) to make this a proper answer. – Roland Aug 07 '13 at 11:24
0

If you have 5M rows and you use == to identify rows to subset, then for each pass of your loop, you are performing 5M comparisons. If you instead key your data (as it inherently is) then you can increase efficiency significantly:

library(data.table)
OriginalDT <- as.data.table(OriginalDataFrame)
setkey(OriginalDT, dataIndex)

# Now inside your foreach:
OriginalDT[ .( IndexList[[i]] ) ]

Note that the setkey function uses a very fast implementation of radix sort. However if your data is already guaranteed to be sorted, @eddi or @arun had posted a nice hack to simply set the attribute to the DT. (I can't find it right now, but perhaps someone can edit this answer and link to it).

You might try just collecting all the results into a list of data.tables then using rbindlist and compare the speed against using .combine=rbind (if you do, please feel free to post benchmark results). I've never tested .combine=rbindlist but that might work as well and would be interesting to try.

edit:

If the sole task is to index the data.frame, then simply use:

dataIndex[ .( IndexList ) ]

No foreach necessary and you still leverage the key's DT

Community
  • 1
  • 1
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • It doesn't seem like there is need for `foreach`. It's a misguided attempt to use parallelization to speed up a loop instead of using the vectorized approach. – Roland Aug 07 '13 at 11:27
  • @Roland I presume (perhaps incorrectly??) that there would be some operation after indexing, for exactly the same reason. – Ricardo Saporta Aug 07 '13 at 11:30
  • Thanks for that suggestion Ricardo. Your analysis of the problem is on the dot. I didn't go the data.table route yet. I was thinking of using `sqldf` to see if that would make any difference but your suggestion sounds better. I will definitely post benchmark results when I have them. might take up-to the weekend to get the machine free for benchmarking. – SGH Aug 07 '13 at 11:31
  • @user2660094 no prob. Please see the two edits. Sqldf is a nice option as well – Ricardo Saporta Aug 07 '13 at 11:44