9

When reading records from a database cursor it is often unknown in advance how many rows are coming up. This makes it impossible to pre-allocate a list of the right size to store these objects.

What is an efficient way of storing all records in a list, when we the total size is unknown? The basic list type is slow because it will copy the entire list every time an element is appended:

x <- list()
for(i in 1:1e5){
  x[[i]] <- list("foo" = rnorm(3), bar = TRUE)
}

An environment is more efficient, but it is a map rather than an ordered set. So we need to convert the index into a string and then later sort the keys to retrieve the values, which seems suboptimal:

env <- new.env()
for(i in 1:1e5){
  env[[sprintf("%09d", i)]] <- list("foo" = rnorm(3), bar = TRUE)
}
x <- lapply(sort(ls(env)), get, env, inherits = FALSE)

A pairlist is supposed to be a linked list in R, however R seems to coerse it into a regular list whenever an element is appended from within R.

Jeroen Ooms
  • 31,998
  • 35
  • 134
  • 207
  • Where does it say that `pairlist` is a linked list? I think that’s very unlikely, given its performance characteristics (indexed access is almost certainly constant, not linear, according to a quick `microbenchmark` test). [The documentation mentions “dotted pair list” but that may just refer to the origin of the `pairlist` design, rather than a concrete implementation.] – Konrad Rudolph Apr 05 '15 at 20:26
  • How about putting them in an environment? That's essentially a (hashed) dictionary. You can just number the records and call them `1`, `2`, etc. in the environment. This should be fast. And then convert to a list in the end. – Gabor Csardi Apr 05 '15 at 22:00
  • That is described in the post. But sorting keys and extracting values in the right order is a bit inefficient. – Jeroen Ooms Apr 05 '15 at 22:13
  • 3
    @KonradRudolph, a pairlist (`LISTSXP`) is indeed a linked list. See https://github.com/wch/r-source/blob/trunk/src/main/gram.y#L1055-L1104 for an example of how R uses them for the parser. – Kevin Ushey Apr 05 '15 at 23:12
  • @Jeroen Yes, sorry for not reading the question! :( You can actually convert environments to (sorted) lists faster, see my answer. – Gabor Csardi Apr 07 '15 at 01:23

6 Answers6

6

This is slow:

 > x <- list()
 > for(i in 1:1e5){x[[i]]=list(foo=rnorm(3),bar=TRUE)}

I gave up waiting. But this is fast, almost instantaneous:

 > x <- list()
 > length(x)=1e5
 > for(i in 1:1e5){x[[i]]=list(foo=rnorm(3),bar=TRUE)}

So I reckon the way to go is to extend the list length by 10000 every time, and prune it back when you get to the last element and know the final output.

 > length(x)=2e5  # extend by another 1e5
 > for(i in 1:1e5){x[[i+1e5]]=list(foo=rnorm(3),bar=TRUE)}
 > length(x)=3e5  # and again... but this time only 100 more elts:
 > for(i in 1:100){x[[i+2e5]]=list(foo=rnorm(3),bar=TRUE)}
 > length(x) = 2e5 + 100

Another approach is to double the list size each time you need more elements.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Well I don't know the value for `1e5` in advance. It might be 10 records, it might be 10 million. This still results in a lot of copying... – Jeroen Ooms Apr 05 '15 at 20:14
  • 1
    Which is why you might want to double the length of the list every time. Possible dupe anyway now I have a look: http://stackoverflow.com/questions/17046336/here-we-go-again-append-an-element-to-a-list-in-r?lq=1 – Spacedman Apr 05 '15 at 20:16
4

I think you need to dive into C / C++ to do this most efficiently -- R doesn't really provide any facilities at the R language level for modifying things in place (including pairlists, but excepting environments), so I would suggest:

  1. Using e.g. C++ STL containers, which can grow efficiently, then coercing that back to whatever output you need once you have it, or

  2. Just using plain old R pairlists, which you can interact with and extend fairly readily at the C level, and then coercing that at the end (if necessary).

You can, of course, use method (1) yourself with plain R by making something like a 'growable' vectors (e.g. track the capacity, double it whenever necessary, and then shrink to fit at the end) but generally when you need this much low level control it's worth dipping into C / C++.

Kevin Ushey
  • 20,530
  • 5
  • 56
  • 88
4

I understand that you probably don't need this answer, so this just for the record: environments are not that slow, you just need to convert them to list "properly".

This is your code, for reference. Yes, this is slow.

system.time({
  env <- new.env()
  for(i in 1:1e5){
    env[[sprintf("%09d", i)]] <- list("foo" = rnorm(3), bar = TRUE)
  }
})
#>    user  system elapsed 
#>   1.583   0.034   1.632 

system.time(
  x <- lapply(sort(ls(env)), get, env, inherits = FALSE)
)
#>    user  system elapsed 
#>   1.595   0.014   1.629 

A slightly faster way to put in the elements:

system.time({
  env <- new.env()
  for(i in 1:1e5){
    env[[as.character(i)]] <- list("foo" = rnorm(3), bar = TRUE)
  }
})
#>    user  system elapsed 
#>   1.039   0.023   1.072 

Not as fast as pre-allocated lists, but almost:

system.time({
  l <- list()
  length(l) <- 1e5
  for(i in 1:1e5){
    l[[i]] <- list("foo" = rnorm(3), bar = TRUE)
  }
})

#>    user  system elapsed 
#>   0.870   0.013   0.889 

A much faster way to convert the environment to a sorted list:

system.time({
  x <- as.list(env, sorted = FALSE)
  x <- x[order(as.numeric(names(x)))]
})

#>    user  system elapsed 
#>   0.073   0.000   0.074 

If this is fast enough for you, then it is a lot easier than C code and/or re-allocating storage.

Gabor Csardi
  • 10,705
  • 1
  • 36
  • 53
  • Clever way to convert the environment to a list! By the way, for populating the environment or list, most of the time is spent in the `list("foo" = ...)`. Using a constant value reduces the time 0.09 seconds on my computer – wch Apr 17 '15 at 03:25
3

Below a C implementation of c() for two pairlists.

#include <Rinternals.h>

SEXP C_join_pairlist(SEXP x, SEXP y) {
  if(!isPairList(x) || !isPairList(y))
    Rf_error("x and y must be pairlists");

  //special case
  if(x == R_NilValue)
    return y;

  //find the tail of x
  SEXP tail = x;
  while(CDR(tail) != R_NilValue)
    tail = CDR(tail);

  //append to tail
  SETCDR(tail, y);
  return x;
}

And a simple R wrapper:

join_pairlist <- function(x, values){
  .Call(C_join_pairlist, x, values)
}

You use it like this:

> x <- pairlist("foo", "bar")
> y <- pairlist("baz", "bla", "boe")
> x <- join_pairlist(x,y)
[1] TRUE

> print(x)
[[1]]
[1] "foo"

[[2]]
[1] "bar"

[[3]]
[1] "baz"

[[4]]
[1] "bla"

[[5]]
[1] "boe"

This is efficient, but also dangerous because it changes the value of x without duplicating it. It is quite easy to accidentally introduce circular references this way.

Jeroen Ooms
  • 31,998
  • 35
  • 134
  • 207
3

A little while back I did some experiments with implementing stacks and queues with pairlists vs. lists in R, and I put them in this package: https://github.com/wch/qstack. I've added some benchmarks to the README.

The short version: using a pairlist isn't really faster than using a list and doubling as it grows. Also:

  • It's dangerous to directly modify a pairlist that's used by any other R code, because R assumes that pairlists are copy-on-modification.
  • Lists are more memory efficient because items don't need pointers to the next item.
  • Random access in a list is much faster than in a pairlist.
  • Pairlists can be faster if you need to insert or remove items in the middle of them (using C).
wch
  • 4,069
  • 2
  • 28
  • 36
0

An example of a repeatedly doubling list implementation is in my answer at https://stackoverflow.com/a/32870310/264177. It is not implemented as linked list but as an expanding array. It is quite a bit faster than other alternatives for large data sets.

I actually built that for the same problem you are describing here, storing a large number of items retrieved from a database where you do not know the number of items before hand.

Community
  • 1
  • 1
JanKanis
  • 6,346
  • 5
  • 38
  • 42