11

I have tried desperately to find the answer via Google and failed. I am about to do the benchmark myself but thought that maybe someone here knows the answer a or at least a reference where this is documented.

To expand on my question: suppose I have a list L in R of length N, where N is rather large (say, 10000, 100.000, 1 million or more). Assume my list has names for every element. `

I wonder how long does it take to retrieve a single named entry, i.e, to do

 L[[ "any_random_name" ]]  

Is this time O(N), i.e. proportional to the length of the list, or is it O(1), that is, constant independent of the name of the list. or is it maybe O( log N ) ?

Mateo
  • 1,494
  • 1
  • 18
  • 27
  • 3
    This strongly suggests that it isn't `O(1)`: https://www.r-bloggers.com/hash-table-performance-in-r-part-i/ Also, see this question: http://stackoverflow.com/q/3470447/4996248 (The first link shows that it is `O(n)`. They are timing the result of looking up *all* keys, which is quadratic although they mistakenly say it is exponential). – John Coleman Dec 27 '16 at 23:29
  • 2
    For a single value lookup (e.g. `"any_random_name"`), the relevant part of [`do_subset2_dflt`](https://github.com/wch/r-source/blob/d8f952565bb9c48bd524c368f3e4ac0d3de096b0/src/main/subset.c#L1018-L1030) should be the call to [`get1index`](https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/main/subscript.c#L224-L233), which appears to be linear. – nrussell Dec 27 '16 at 23:51
  • 4
    A [quick benchmark](https://gist.github.com/nathan-russell/d09e220899115d85b10c0959188a287b) seems to confirm this. – nrussell Dec 27 '16 at 23:57
  • @nrussell The link to the source is good, as well as the benchmarking. Perhaps you could post this as an answer. I'd upvote it for sure. – John Coleman Dec 28 '16 at 16:18

2 Answers2

4

The worst case for name lookup is O(n). Take a look here: https://www.refsmmat.com/posts/2016-09-12-r-lists.html .

jidicula
  • 3,454
  • 1
  • 17
  • 38
-2

Interesting, the answer turns out to be both O(1) and O(n). The timing depends not so much on the length of the list, as the length of the list before the named element you want is obtained.

Let's define some lists of different lengths:

library(microbenchmark)

makelist <- function(n){
   L <- as.list(runif(n))   
   names(L) <- paste0("Element", 1:n)
   return(L)
}

L100 <- makelist(100)
L1000 <- makelist(1000)
LMillion <- makelist(10^6)
L10Million <- makelist(10^7)

If we try to extract the element named Element100 our of each of these - the 100th element - it takes basically the same length of time:

microbenchmark(
   L10Million[["Element100"]],
   LMillion[["Element100"]],
   L1000[["Element100"]],
   L100[["Element100"]]
)

Unit: nanoseconds
                       expr min  lq    mean median   uq   max neval cld
 L10Million[["Element100"]] 791 791  996.59    792 1186  2766   100   a
   LMillion[["Element100"]] 791 791 1000.56    989 1186  1976   100   a
      L1000[["Element100"]] 791 791 1474.64    792 1186 28050   100   a
       L100[["Element100"]] 791 791 1352.21    792 1186 17779   100   a

But if we try to get the last element, the time required is O(n)

microbenchmark(
   L10Million[["Element10000000"]],
   LMillion[["Element1000000"]],
   L1000[["Element1000"]],
   L100[["Element100"]]
)

Unit: nanoseconds
                            expr       min        lq         mean    median        uq       max neval cld
 L10Million[["Element10000000"]] 154965791 165074635 172399030.13 169602043 175170244 235525230   100   c
    LMillion[["Element1000000"]]  15362773  16540057  17379942.70  16967712  17734922  22361689   100  b 
          L1000[["Element1000"]]      9482     10668     17770.94     16594     20544     67557   100 a  
            L100[["Element100"]]       791      1186      3995.15      3556      6322     24100   100 a 


library(ggplot2)
library(dplyr)
res <- data.frame(x = c(100, 1000, 10^6, 10^7), 
           y = c(3556, 16594, 16967715, 169602041)) 

ggplot(res, aes(x = x, y = y ))+
   geom_smooth(method = "lm") +
   geom_point(, size = 3) +
   scale_x_log10() +
   scale_y_log10()

enter image description here

Peter Ellis
  • 5,694
  • 30
  • 46
  • 9
    Nice answer -- but saying that it is both `O(1)` and `O(n)` is an apples and oranges comparison. `O(n)` describes the *worst* case -- which is how complexity is typically measured. It isn't too surprising that the (almost) *best* case is `O(1)`. *Average* case (where names are chosen at random from the set of valid names) will also be `O(n)`. – John Coleman Dec 28 '16 at 21:06
  • 11
    WTF, it's not both. The answer is just O(n). – Navin Jan 29 '17 at 12:00
  • @Nevin my point was that it depends on which named element you look for. Adding more elements to a list when you look up one of the early elements doesn't increase search time, as I show. You could say this is unlikely use case but it doesn't deserve "wtf". John Coleman's comment is a better response. – Peter Ellis Feb 09 '17 at 07:08