I'm looking for a speedy solution for randomly subsetting vectors nested in a list.
If we simulate the following data, we get a list l
that holds 3 million vectors inside, each one is of length 5. But I want the length of each vector to vary. So I thought I should apply a function that randomly subsets each vector. The problem is, this method is not as speedy as I wished.
simulate data: the list l
library(stringi)
set.seed(123)
vec_n <- 15e6
vec_vals <- 1:vec_n
vec_names <- stringi::stri_rand_strings(vec_n, 5)
my_named_vec <- setNames(vec_vals, vec_names)
split_func <- function(x, n) {
unname(split(x, rep_len(1:n, length(x))))
}
l <- split_func(my_named_vec, n = vec_n / 5)
head(l)
#> [[1]]
#> HmPsw Qk8NP Quo3T 8f0GH nZmjN
#> 1 3000001 6000001 9000001 12000001
#>
#> [[2]]
#> 2WtYS ZaHFl 6YjId jbGuA tAG65
#> 2 3000002 6000002 9000002 12000002
#>
#> [[3]]
#> xSgZ6 jM5Uw ujPOc CTV5F 5JRT5
#> 3 3000003 6000003 9000003 12000003
#>
#> [[4]]
#> tF2Kx r4ZCI Ooklo VOLHU M6z6H
#> 4 3000004 6000004 9000004 12000004
#>
#> [[5]]
#> tgdze w8d1B FYERK jlClo NQfsF
#> 5 3000005 6000005 9000005 12000005
#>
#> [[6]]
#> hXaH9 gsY1u CjBwC Oqqty dxJ4c
#> 6 3000006 6000006 9000006 12000006
Now that we have l
, I wish to subset each vector randomly: meaning that the number of elements being subsetted (per vector) will be random. So one option is to set the following utility function:
randomly_subset_vec <- function(x) {
my_range <- 1:length(x)
x[-sample(my_range, sample(my_range))]
}
lapply(head(l), randomly_subset_vec)
#> [[1]]
#> Quo3T
#> 6000001
#>
#> [[2]]
#> 6YjId jbGuA
#> 6000002 9000002
#>
#> [[3]]
#> xSgZ6 jM5Uw ujPOc CTV5F
#> 3 3000003 6000003 9000003
#>
#> [[4]]
#> Ooklo
#> 6000004
#>
#> [[5]]
#> named integer(0)
#>
#> [[6]]
#> CjBwC Oqqty dxJ4c
#> 6000006 9000006 12000006
But running this procedure over the entire l
takes forever. I've tried using rrapply
which is a fast package for dealing with lists, and it takes "only" 110 seconds on my machine.
library(rrapply)
library(tictoc)
tic()
l_subsetted <- rrapply(object = l, f = randomly_subset_vec)
toc()
#> 110.23 sec elapsed
I will be happy with either of the following:
- Is there a speedier alternative to:
rrapply(object = l, f = randomly_subset_vec)
- Or more generally, is there a speedier way to start with
my_named_vec
and arrive atl_subsetted
?