40

Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.

For example, the following code will populate a hash with up to 10000 key/value pairs, where the keys are random letters and the values are random integers. Then, it does 10000 random lookups in that hash.

#!/usr/bin/perl -w
use strict;

my @letters = ('a'..'z');

print @letters . "\n";
my %testHash;

for(my $i = 0; $i < 10000; $i++) {
    my $r1 = int(rand(26));
    my $r2 = int(rand(26));
    my $r3 = int(rand(26));
    my $key = $letters[$r1] . $letters[$r2] . $letters[$r3];
    my $value = int(rand(1000));
    $testHash{$key} = $value;
}

my @keyArray = keys(%testHash);
my $keyLen = scalar @keyArray;

for(my $j = 0; $j < 10000; $j++) {
    my $key = $keyArray[int(rand($keyLen))];
    my $lookupValue = $testHash{$key};
    print "key " .  $key . " Lookup $lookupValue \n";
}

Now that increasingly, I am wanting to have a hash-like data structure in R. The following is the equivalent R code:

testHash <- list()

for(i in 1:10000) {
  key.tmp = paste(letters[floor(26*runif(3))], sep="")
  key <- capture.output(cat(key.tmp, sep=""))
  value <- floor(1000*runif(1))
  testHash[[key]] <- value
}

keyArray <- attributes(testHash)$names
keyLen = length(keyArray);

for(j in 1:10000) {
  key <- keyArray[floor(keyLen*runif(1))]
  lookupValue = testHash[[key]]
  print(paste("key", key, "Lookup", lookupValue))
}

The code seem to be doing equivalent things. However, the Perl one is much faster:

>time ./perlHashTest.pl
real    0m4.346s
user    **0m0.110s**
sys 0m0.100s

Comparing to R:

time R CMD BATCH RHashTest.R

real    0m8.210s
user    **0m7.630s**
sys 0m0.200s

What explains the discrepancy? Are lookups in R lists just not good?

Increasing to 100,000 list length and 100,000 lookups only exaggerates the discrepancy? Is there a better alternative for the hash data structure in R than the native list()?

Arun
  • 116,683
  • 26
  • 284
  • 387
stevejb
  • 2,414
  • 5
  • 26
  • 41
  • I'm not sure you sound married to the hash structure per se. Perhaps you're interested in a more R-like way of doing things. If so, please add to your question a relatively detailed example of the purpose of your hash table. – John Aug 12 '10 at 18:32
  • As an aside, it's not clear what's slowing things down compared to perl. It could be your string generation (known to be fast in perl). Also, you're doing things un-R like... see my answer. – John Aug 12 '10 at 18:37
  • I think it is the hashing, according to Vince's answer below. My actual code involves iterating over a list, for each item in the list selecting from a database with RMySQL, doing some calculations, then updating the database. I can post a better example later. Thanks. – stevejb Aug 12 '10 at 19:02
  • Also, even if my R code is not written R-like manner (the for loops, especially) a simple for loop in R and Perl should run at about the same speed. I was hoping that my demo would isolate the discrepancy between R's list() and Perl's hash. What I was essentially trying to do was cache some interim results that I would need to use repeatedly. – stevejb Aug 12 '10 at 19:05

7 Answers7

37

The underlying reason is that R lists with named elements are not hashed. Hash lookups are O(1), because during insert the key is converted to an integer using a hash function, and then the value put in the space hash(key) % num_spots of an array num_spots long (this is a big simplification and avoids the complexity of dealing with collisions). Lookups of the key just require hashing the key to find the value's position (which is O(1), versus a O(n) array lookup). R lists use name lookups which are O(n).

As Dirk says, use the hash package. A huge limitation with this is that it uses environments (which are hashed) and overriding of [ methods to mimic hash tables. But an environment cannot contain another environment, so you cannot have nested hashes with the hash function.

A while back I worked on implementing a pure hash table data structure in C/R that could be nested, but it went on my project back burner while I worked on other things. It would be nice to have though :-)

Vince
  • 7,608
  • 3
  • 41
  • 46
  • Hey Vince. Thanks for the pointer and explanation. I will give that hash package a try the next time I need it. The task at hand is done, but I am sure I will use it in the near future. Thanks a lot! – stevejb Aug 12 '10 at 19:07
  • 8
    Unfortunately that explanation is a bit too much of a simplification - if you are indexing a vector, R will switch to hashing above a certain size threshold. According to Simon Urbanek "subscript.c@493 has the exact formula". – hadley Aug 13 '10 at 18:20
  • 1
    Thanks Hadley! I remember Duncan TL telling me this ages ago, but I completely forgot about it until now. I'll check out the code and try to update my post. I think this is R > 2.9, no? – Vince Aug 14 '10 at 01:50
20

You could try environments and/or the hash package by Christopher Brown (which happens to use environments under the hood).

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • 1
    What advantages does the hash package provide over just using environments? Are there any negative tradeoffs? I've traditionally used environments to give me hashing – geoffjentry Aug 13 '10 at 21:49
  • 1
    Christopher had a nice presentation about it at useR -- essentially you get all the usual operators thrown in. – Dirk Eddelbuettel Aug 13 '10 at 23:54
11

Your code is very un R-like and is one of the reasons it's so slow. I haven't optimized the code below for maximum speed, only R'ness.

n <- 10000

keys <- matrix( sample(letters, 3*n, replace = TRUE), nrow = 3 )
keys <- apply(keys, 2, paste0, collapse = '')
value <- floor(1000*runif(n))
testHash <- as.list(value)
names(testHash) <- keys

keys <- sample(names(testHash), n, replace = TRUE)
lookupValue = testHash[keys]
print(data.frame('key', keys, 'lookup', unlist(lookupValue)))

On my machine that runs almost instantaneously excluding the printing. Your code ran about the same speed you reported. Is it doing what you want? You could set n to 10 and just look at the output and testHash and see if that's it.

NOTE on syntax: The apply above is simply a loop and those are slow in R. The point of those apply family commands is expressiveness. Many of the commands that follow could have been put inside a loop with apply and if it was a for loop that would be the temptation. In R take as much out of your loop as possible. Using apply family commands makes this more natural because the command is designed to represent the application of one function to a list of some sort as opposed to a generic loop (yes, I know apply could be used on more than one command).

John
  • 23,360
  • 7
  • 57
  • 83
10

I'm a bit of an R hack, but I'm an empiricist so I'll share some things I have observed and let those with greater theoretical understanding of R shed light into the whys.

  • R seems much slower using standard streams than Perl. Since stdin and stout are much more commonly used in Perl I assume it has optimizations around how it does these things. So in R I find it MUCH faster to read/write text using the built in functions (e.g write.table).

  • As others have said, vector operations in R are faster than loops... and w.r.t. speed, most apply() family syntax is simply a pretty wrapper on a loop.

  • Indexed things work faster than non-indexed. (Obvious, I know.) The data.table package supports indexing of data frame type objects.

  • I've never used hash environments like @Allen illustrated (and I've never inhaled hash... as far as you know)

  • Some of the syntax you used works, but could be tightened up. I don't think any of this really matters for speed, but the code's a little more readable. I don't write very tight code, but I edited a few things like changing floor(1000*runif(1)) to sample(1:1000, n, replace=T). I don't mean to be pedantic, I just wrote it the way I would do it from scratch.

So with that in mind I decided to test the hash approach that @allen used (because it's novel to me) against my "poor man's hash" which I've created using an indexed data.table as a lookup table. I'm not 100% sure that what @allen and I are doing is exactly what you did in Perl because my Perl is pretty rusty. But I think the two methods below do the same thing. We both sample the second set of keys from the keys in the 'hash' as this prevents hash misses. You'd want to test how these examples handle hash dupes as I have not given that much thought.

require(data.table)

dtTest <- function(n) {

  makeDraw <- function(x) paste(sample(letters, 3, replace=T), collapse="")
  key <- sapply(1:n, makeDraw)
  value <- sample(1:1000, n, replace=T)

  myDataTable <- data.table(key, value,  key='key')

  newKeys <- sample(as.character(myDataTable$key), n, replace = TRUE)

  lookupValues <- myDataTable[newKeys]

  strings <- paste("key", lookupValues$key, "Lookup", lookupValues$value )
  write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )
}

#

hashTest <- function(n) {

  testHash <- new.env(hash = TRUE, size = n)

  for(i in 1:n) {
    key <- paste(sample(letters, 3, replace = TRUE), collapse = "")
    assign(key, floor(1000*runif(1)), envir = testHash)
  }

  keyArray <- ls(envir = testHash)
  keyLen <- length(keyArray)

  keys <- sample(ls(envir = testHash), n, replace = TRUE)
  vals <- mget(keys, envir = testHash)

  strings <- paste("key", keys, "Lookup", vals )
  write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )

  }

if I run each method using 100,000 draws, I get something like this:

> system.time(  dtTest(1e5))
   user  system elapsed 
  2.750   0.030   2.881 
> system.time(hashTest(1e5))
   user  system elapsed 
  3.670   0.030   3.861 

Keep in mind that this is still considerably slower than the Perl code which, on my PC, seems to run 100K samples in well under a second.

I hope the above example helps. And if you have any questions as to why maybe @allen, @vince, and @dirk will be able to answer ;)

After I typed the above, I realized I had not tested what @john did. So, what the hell, let's do all 3. I changed the code from @john to use write.table() and here's his code:

johnsCode <- function(n){
  keys = sapply(character(n), function(x) paste(letters[ceiling(26*runif(3))],
    collapse=''))
  value <- floor(1000*runif(n))
  testHash <- as.list(value)
  names(testHash) <- keys

  keys <- names(testHash)[ceiling(n*runif(n))]
  lookupValue = testHash[keys]

  strings <- paste("key", keys, "Lookup", lookupValue )
  write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )
}

and the run time:

> system.time(johnsCode(1e5))
   user  system elapsed 
  2.440   0.040   2.544 

And there you have it. @john writes tight/fast R code!

JD Long
  • 59,675
  • 58
  • 202
  • 294
  • Wow, that is really cool. Thank you for that analysis and input. I worked at a web hosting company during my teenage years, and even after years and years my perl habits stick around. Think in R! Think in R! :) – stevejb Aug 14 '10 at 00:27
  • I can get it under 2s on your machine, maybe even 1.5, if you want :) I only optimized for r-ness. For example, sprintf() could be used similar to paste() and it's quite a bit faster than paste(). Also, explicitly unlisting the keys instead of sapply() is faster. I could go on I think :). – John Aug 26 '10 at 00:37
  • You might want to check my revised code. It's 30% - 50% than before. – John Dec 19 '15 at 21:04
6

But an environment cannot contain another environment (quoted from Vince's answer).

Maybe it was that way some time ago (I don't know) but this information seems not to be accurate anymore:

> d <- new.env()
> d$x <- new.env()
> d$x$y = 20
> d$x$y
[1] 20

So environments make a pretty capable map/dict now. Maybe you will miss the '[' operator, use the hash package in that case.

This note taken from the hash package documentation may also be of interest:

R is slowly moving toward a native implementation of hashes using enviroments, (cf. Extract. Access to environments using $ and [[ has been available for some time and recently objects can inherit from environments, etc. But many features that make hashes/dictionaries great are still lacking, such as the slice operation, [.

memeplex
  • 2,297
  • 27
  • 26
4

First off, as Vince and Dirk has said, you are not using hashes in your example code. A literal translation of the perl example would be

#!/usr/bin/Rscript
testHash <- new.env(hash = TRUE, size = 10000L)
for(i in 1:10000) {
  key <- paste(sample(letters, 3, replace = TRUE), collapse = "")
  assign(key, floor(1000*runif(1)), envir = testHash)
}

keyArray <- ls(envir = testHash)
keyLen <- length(keyArray)

for(j in 1:10000) {
  key <- keyArray[sample(keyLen, 1)]
  lookupValue <- get(key, envir = testHash)
  cat(paste("key", key, "Lookup", lookupValue, "\n"))
}

which runs plenty fast on my machine, them main time being the setup. (Try it and post the timings.)

But the real problem, as John said, is that you have to think vectors in R (like map in perl) and his solution is probably the best. If you do want to use hashes, consider

keys <- sample(ls(envir = testHash), 10000, replace = TRUE)
vals <- mget(keys, envir = testHash)

after the same setup as above, which is near-instantaneous on my machine. To print them all try

cat(paste(keys, vals), sep="\n")

Hope this helps a little.

Allan

Allan Engelhardt
  • 1,421
  • 10
  • 5
0

If you are trying to hash 10,000,000+ things in R using the hash package, then building the hash takes a very very long time. It crashed R, despite the fact that the data is taking less than 1/3 of my memory.

I had much better performance with the package data.table using setkey. If you are not familiar with data.table and setkey, you might start here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html

I realize the original question referred to 10,000 things, but google directed me here a couple days ago. I tried to use the hash package and had a really hard time. Then I found this blog post which suggests that building the hash can take hours for 10M+ things and this aligns with my experience:
https://appsilon.com/fast-data-lookups-in-r-dplyr-vs-data-table/?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com

karl
  • 41
  • 3