computationally efficient way to manipulate the levels of large deeply-nested objects?

Question

I have a list of lists of vectors (non a typo, re-confirming that it is infact a list of lists of vectors) that is 76 million in length. So, there is a list of 76 million items where each item is a list of two vectors.

All the vectors are, of uniform length (6 items).

For example the data itself looks as follows for list_of_list[1:50]:

dput output

list(list(c(4, 4, 1, 0, 1, 0), c(3, 3, 2, 2, 0, 0)), list(c(4, 
4, 1, 0, 1, 0), c(3, 4, 3, 1, 0, 0)), list(c(4, 4, 1, 0, 1, 0
), c(4, 5, 1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(5, 8, 0, 
0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(5, 5, 0, 2, 0, 0)), list(
    c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 0, 0)), list(c(4, 4, 
1, 0, 1, 0), c(4, 5, 1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), 
    c(4, 4, 1, 0, 1, 0)), list(c(4, 4, 1, 0, 1, 0), c(6, 10, 
1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 4, 3, 1, 0, 0)), 
    list(c(4, 4, 1, 0, 1, 0), c(5, 7, 2, 0, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(40, 10, 0, 15, 8, 0)), list(c(4, 4, 1, 
    0, 1, 0), c(24L, 7L, 6L, 20L, 8L, 1L)), list(c(4, 4, 1, 0, 
    1, 0), c(39L, 22L, 9L, 5L, 8L, 1L)), list(c(4, 4, 1, 0, 1, 
    0), c(34, 36, 17, 15, 0, 2)), list(c(4, 4, 1, 0, 1, 0), c(36L, 
    42L, 18L, 4L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(4, 5, 
    1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(4, 8, 3, 0, 0, 
    0)), list(c(4, 4, 1, 0, 1, 0), c(3, 1, 2, 2, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(6, 9, 0, 1, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(5, 5, 0, 2, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(6, 10, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(6, 
    10, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 15, 0, 0, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 0, 0)), 
    list(c(4, 4, 1, 0, 1, 0), c(4, 2, 1, 2, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(28, 24, 19, 14, 4, 0)), list(c(4, 4, 1, 
    0, 1, 0), c(40, 56, 19, 11, 0, 0)), list(c(4, 4, 1, 0, 1, 
    0), c(32L, 33L, 14L, 17L, 1L, 2L)), list(c(4, 4, 1, 0, 1, 
    0), c(24L, 55L, 11L, 16L, 6L, 1L)), list(c(4, 4, 1, 0, 1, 
    0), c(27, 10, 6, 19, 8, 0)), list(c(4, 4, 1, 0, 1, 0), c(31, 
    21, 11, 19, 4, 0)), list(c(4, 4, 1, 0, 1, 0), c(37L, 60L, 
    12L, 7L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(29L, 8L, 3L, 
    18L, 8L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(21L, 24L, 20L, 
    14L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(6, 10, 1, 0, 0, 
    0)), list(c(4, 4, 1, 0, 1, 0), c(5, 9, 2, 0, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(7, 13, 0, 0, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(6, 12, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(5, 8, 1, 1, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 
    7, 0, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 6, 1, 1, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(4, 3, 0, 3, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(3, 2, 3, 1, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(4, 4, 1, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 
    3, 2, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 7, 0, 2, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 1, 2, 2, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(6, 7, 0, 1, 0, 0)))

Just FYI, the list of lists was made using combn() using this template: combn(focal_list,2,simplify = FALSE)

Is there a computationally efficient way to turn this into a table of two columns where each row is one item from the list of lists? All the first vectors become the first column and all the second vectors become the second column?

I tried the following and this just kept going after 10-12 minutes with no output, which is just to expensive for my use-case :

dt <- data.table(col1 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][1]),
                 col2 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][2])))

I could use a foreach loop to detangle the deeply nested object and read in the vectors as chars separated by a simple char and then use another foreach loop to create a data.table but before I do that, is there a simpler way in R that I am missing?

Please note for clarification that I want to maintain the vector() like nature of the lowest level items .i.e when you make a table out of the list of lists, each item should be a vector and the data.table should be two columns, it seems R likes to flatten vectors and list when trying to make tables.

Please use `dput(list_of_lists[1:5])` on your data and paste the output into your question. If it's too big, make it more minimal. Your data can't be easily reconstructed from your current example. — Ritchie Sacramento, Jul 13 '23 at 12:12
Pasted the `dput` output of a sample of the first 50 items from the primary list. — Sudoh, Jul 13 '23 at 12:53
With 76 million objects to manipulate, I doubt any solution is going to be instant... — Limey, Jul 13 '23 at 12:55
`list_of_list` --> `list_of_lists`? or something else? Also it has too many close-parens. — r2evans, Jul 13 '23 at 13:01
@RitchieSacramento, I tried that and it returned two unnested columns, very-long-format. The OP's `dt` is two list-columns, each containing numeric vectors length-6 (I think all were len-6, at least). — r2evans, Jul 13 '23 at 13:09
@r2evans Here is a [link](https://wormhole.app/6XdE3#Bc20gJ5nwuxG3xKTGSvaFA) to the object itself on wormhole. If you feel like playing with it, it is 468 megs and is absolutely massive. — Sudoh, Jul 13 '23 at 13:15
@RitchieSacramento I have tried `data.table::rbindlist(list_of_lists)` by itself and inside a `foreach` loop, it flattens the vectors or simplifies them. — Sudoh, Jul 13 '23 at 13:16
@r2evans - I'm not going to download the file and test, but OP could try assigning an id to the groups and collapsing, `my_dt <- rbindlist(list_of_lists, idcol = "id"); my_dt[, list(dat = list(as.list(.SD))), by = id]`. — Ritchie Sacramento, Jul 13 '23 at 13:21
what of `do.call(rbind, lapply(list_of_lists, do.call, what = cbind))`? or simply a `c++` for-loop? — Onyambu, Jul 13 '23 at 13:27
@RitchieSacramento, nope, using the data in the question this does not preserve two list-columns. While we might be able to add code to your effort that would _restore_ the original two sets of lists from the not-list-columns, I suggest that this starts to be not-more-efficient than the OP's `lapply(...)`x2 solution which preserves the lists without jumping back and forth. I'm inferring that the OP was looking for an `data.table` in-place operation to avoid copies and such. — r2evans, Jul 13 '23 at 13:32
I'd go with @RitchieSacramento's first suggestion, and then work out the indexing when I needed those 6 element vectors. That is, `my_dt <- rbindlist(list_of_lists); getvec <- function(i, j) my_dt[(i-1)*6 + 1:6, j]; getvec(3, 2)` to get the 3rd vector from column 2. — user2554330, Jul 13 '23 at 13:34
@r2evans Yep, breaking structure only to reconstitute it later would just increase the computational price than decrease it. — Sudoh, Jul 13 '23 at 13:35
@Sudoh, your `dt` (desired output) has very-nested lists; `as.data.table(purrr::list_transpose(list_of_list))` is slightly different, does it fit your needs? If so, is it any faster? — r2evans, Jul 13 '23 at 13:43
Use Rcpp to be able to carry out the transformation. 76million is a huge list — Onyambu, Jul 13 '23 at 14:10

ThomasIsCoding · Accepted Answer · 2023-07-13T20:15:14.247

1

I think you may have several approaches to make it, for example

rbindlist + rapply

rbindlist(rapply(list_of_list, list, how = "replace"))

as.data.frame + rbind

as.data.frame(do.call(rbind, list_of_list))

However, the second option, i.e., the base R approach as.data.table + rbind seems much faster than the first one (see the benchmarking below)

microbenchmark(
    f1 = rbindlist(rapply(list_of_list, list, how = "replace")),
    f2 = as.data.frame(do.call(rbind, list_of_list)),
    check = "equivalent"
)

which gives

Unit: microseconds
 expr   min    lq    mean median     uq   max neval
   f1 138.7 168.7 177.896 174.10 185.00 392.6   100
   f2  31.7  38.5  45.127  43.55  50.25  88.8   100

edited Jul 13 '23 at 20:15

answered Jul 13 '23 at 13:47

ThomasIsCoding

96,636
9
24
81

Hey, @ThomasIsCoding, thanks for coming back! I was able to use your first answer thing morning to get the pipeline going. I tried and failed to use `foreach` with chunking to speed up the `rapply` approach but yeah `as.data.frame(do.call(rbind, list_of_list))` does seem much faster. Thank you so much! – Sudoh Jul 13 '23 at 20:35

score 0 · Answer 2 · answered Jul 13 '23 at 17:11

I would suggest you use Rcpp like the code below. Since you have 76million, I recomment running the data in batches, ie 10million each. In my computer, it takes 8 secs to convert 10million into a matrix. Meaning if you do this 8 times, it will take approx 70-80 sec. Store the different matrix matches then combine them into one, probably by writting them into one file in the hard drive.

Rcpp::cppFunction(
'NumericVector combineList(std::vector< std::vector<std::vector<double>>> x){
    int n = x.size();
    int m = x[0].size();
    int p = x[0][0].size();
    std::vector<double> y(n*p*m);
    for(int i = 0; i < n; i++)
        for(int j = 0; j < m; j++)
            for(int k = 0; k < p; k++)
                y[p * (i + n * j) + k] = x[i][j][k];
    NumericVector z = wrap(y);
    z.attr("dim") = Dimension(n*p, m);
    return z;
}'
)

combineList(list_of_lists)
       [,1] [,2]
  [1,]    4    3
  [2,]    4    3
  [3,]    1    2
  [4,]    0    2
  [5,]    1    0
  [6,]    0    0
  [7,]    4    3
  [8,]    4    4
  [9,]    1    3
 [10,]    0    1
 [11,]    1    0
 [12,]    0    0
 [13,]    4    4
 [14,]    4    5
 [15,]    1    1
 [16,]    0    0
 [17,]    1    0
 [18,]    0    1
 [19,]    4    5
 [20,]    4    8

score 0 · Answer 3 · answered Jul 13 '23 at 18:46

I was able to solve this issue fairly easily using this one line of code:

rbindlist(rapply(focal_list, list, how = "replace"))

The fascinating part is that the above code process all 76 millions items in about 2-ish minutes, no Rcpp required (can't say if the packages are using Rcpp underneath the hood).

computationally efficient way to manipulate the levels of large deeply-nested objects?

3 Answers3