One of the problems I often face is needing to look up an arbitrary row from a data.table. I ran into a problem yesterday where I was trying to speed up a loop and using profvis
I found that the lookup from the data.table
was the most costly part of the loop. I then decided to try and find the fastest way to do a single item lookup in R.
The data often takes the form of a data.table
with a key column of the character type. The remaining columns are typically numeric values. I tried to create a random table with similar characteristics to what I often deal with which means >100K rows. I compared the native list, data.table
package and the hash
package. The native list and data.table
were comparable for individual item lookup performance. Hash
appeared to be two orders of magnitude faster. The tests were made up of 10 sets of 10,000 keys randomly sampled to provide for variance in access behavior. Each lookup method used the same sets of keys.
Ultimately my preference would be to either get the row lookup for data.table to be faster instead of having to create a hash table of my data or establish that it cannot be done and just use the hash package when I have to do fast lookup. I don't know if it would be possible but could you create a hash table of references to the rows in the data.table to allow for fast lookup using the hash package? I know that type of thing is possible in C++ but to my knowledge R does not allow this kind of thing due to the lack of pointers.
To Summarize: 1) Am I using data.table correctly for the lookups and therefore this is the speed I should expect for a single row lookup? 2) Would it be possible to create a hash of pointers to the data.table rows to allow for fast lookup that way?
Test System:
Windows 10 Pro x64
R 3.2.2
data.table 1.9.6
hash 2.2.6
Intel Core i7-5600U with 16 GB RAM
Code:
library(microbenchmarkCore) # install.packages("microbenchmarkCore", repos="http://olafmersmann.github.io/drat")
library(data.table)
library(hash)
# Set seed to 42 to ensure repeatability
set.seed(42)
# Setting up test ------
# Generate product ids
product_ids <- as.vector(
outer(LETTERS[seq(1, 26, 1)],
outer(outer(LETTERS[seq(1, 26, 1)], LETTERS[seq(1, 26, 1)], paste, sep=""),
LETTERS[seq(1, 26, 1)], paste, sep = ""
), paste, sep = ""
)
)
# Create test lookup data
test_lookup_list <- lapply(product_ids, function(id){
return_list <- list(
product_id = id,
val_1 = rnorm(1),
val_2 = rnorm(1),
val_3 = rnorm(1),
val_4 = rnorm(1),
val_5 = rnorm(1),
val_6 = rnorm(1),
val_7 = rnorm(1),
val_8 = rnorm(1)
)
return(return_list)
})
# Set names of items in list
names(test_lookup_list) <- sapply(test_lookup_list, function(elem) elem[['product_id']])
# Create lookup hash
lookup_hash <- hash(names(test_lookup_list), test_lookup_list)
# Create data.table from list and set key of data.table to product_id field
test_lookup_dt <- rbindlist(test_lookup_list)
setkey(test_lookup_dt, product_id)
test_lookup_env <- list2env(test_lookup_list)
# Generate sample of keys to be used for speed testing
lookup_tests <- lapply(1:10, function(x){
lookups <- sample(test_lookup_dt$product_id, 10000)
return(lookups)
})
# Native list timing
native_list_timings <- sapply(lookup_tests, function(lookups){
timing <- system.nanotime(
for(lookup in lookups){
return_value <- test_lookup_list[[lookup]]
}
)
return(timing[['elapsed']])
})
# Data.table timing
datatable_timings <- sapply(lookup_tests, function(lookups){
timing <- system.nanotime(
for(lookup in lookups){
return_value <- test_lookup_dt[lookup]
}
)
return(timing[['elapsed']])
})
# Hashtable timing
hashtable_timings <- sapply(lookup_tests, function(lookups){
timing <- system.nanotime(
for(lookup in lookups){
return_value <- lookup_hash[[lookup]]
}
)
return(timing[['elapsed']])
})
# Environment timing
environment_timings <- sapply(lookup_tests, function(lookups){
timing <- system.nanotime(
for(lookup in lookups){
return_value <- test_lookup_env[[lookup]]
}
)
return(timing[['elapsed']])
})
# Summary of timing results
summary(native_list_timings)
summary(datatable_timings)
summary(hashtable_timings)
summary(environment_timings)
These were the results:
> # Summary of timing results
> summary(native_list_timings)
Min. 1st Qu. Median Mean 3rd Qu. Max.
35.12 36.20 37.28 37.05 37.71 39.24
> summary(datatable_timings)
Min. 1st Qu. Median Mean 3rd Qu. Max.
49.13 51.51 52.64 52.76 54.39 55.13
> summary(hashtable_timings)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1588 0.1857 0.2107 0.2213 0.2409 0.3258
> summary(environment_timings)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09322 0.09524 0.10680 0.11850 0.13760 0.17140
It appears that the hash
lookup is approximately two orders of magnitude faster than either the native list or data.table
in this particular scenario.
Update: 2015-12-11 3:00PM PST
I received feedback from Neal Fultz suggesting the use of the native Environment object. Here is the code and result I got:
test_lookup_env <- list2env(test_lookup_list)
# Environment timing
environment_timings <- sapply(lookup_tests, function(lookups){
timing <- system.nanotime(
for(lookup in lookups){
return_value <- test_lookup_env[[lookup]]
}
)
return(timing[['elapsed']])
})
summary(environment_timings)
> summary(environment_timings)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09322 0.09524 0.10680 0.11850 0.13760 0.17140
Indeed, it appears that Environment is faster for individual item access in this scenario. Thank you Neal Fultz for pointing this method out. I appreciate having a more thorough understanding of the object types available in R. My questions still stand: am I using data.table
correctly (I expect so but I am open to critique) and is there a way to provide row access to the rows of a data.table
using some kind of pointer magic which would provide faster individual row access.
Clarification: 2015-12-11 3:52 PM PST
There have been some mentions that my access pattern in the inner-most loop of my test is inefficient. I agree. What I am trying to do is emulate as closely as possible the situation that I am dealing with. The loop that this is actually occurring in does not allow for vectorization which is why I am not using it. I realize this is not strictly the 'R' way of doing things. The data.table
in my code is providing reference information and I do not necessarily know which row I need until I am inside the loop which is why I am trying to figure out how to access an individual item as quickly as possible, preferably with the data still stored in a data.table
. This is also in part a curiosity question, can it be done?
Update 2: 2015-12-11 4:12 PM PST
I received feedback from @jangrorecki that using Sys.time()
is an ineffective means of measuring the performance of a function. I have since revised the code to use system.nanotime()
per the suggestion. The original code has been updated and the timing results.
The question still stands: is this the fastest way to do a row lookup of a data.table
and if so is it possible to create a hash of pointers to the rows for quick lookup? At this point I am most curious how far R can be pushed. As someone who came from C++, this is a fun challenge.
Conclusion
I accepted the answer provided by Neal Fultz because it discussed what I was actually wanting to know. That said, this is not the way data.table
was intended to be used so no one should interpret this to mean data.table
is slow, it is actually incredibly fast. This was a very particular use case that I was curious about. My data comes in as a data.table
and I wanted to know if I could get quick row access while leaving it as a data.table
. I also wanted to compare the data.table
access speed with a hash-table which is what is often used for fast, non-vectorized item lookup.