I have 2 questions:
- Is
hash
faster thandata.table
for Big Data? - How can I deal with multiple values per key, if I want to use a hash-based approach?
I looked at the vignette of the related packages and Googled some potential solutions, but I'm still not sure about the answers to the questions above.
Considering the following post,
R fast single item lookup from list vs data.table vs hash
it seems that a single lookup in a data.table
object is actually quite slow, even slower than in a list in Base R?
However a lookup using a hash object from hash
is very speedy, based on this benchmark -- is that accurate?
However, it looks like the object hash is handling only unique keys?
In the following only 2 (key,value) pairs are created.
library(hash)
> h <- hash(c("A","B","A"),c(1,2,3))
> h
<hash> containing 2 key-value pair(s).
A : 3
B : 2
So, if i have a table with (key,values) where a key can have different values, and i want to do a (quick) lookup for the values corresponding to this key, what is the best object/data structure in R to do that ?
Can we still use the hash
object or is data.table
the most appropriate in this case ?
Let's say we are in the context of dealing a problem with very large tables, otherwise this discussion is irrelevant.
Related link: http://www.r-bloggers.com/hash-table-performance-in-r-part-i/