I have a large data set, which has a list-column that contains nested named vectors. I want to unnest those vectors into two new columns:
- one column for the vector elements
- another column for the vector's element names
Right now, the only straightforward way I've found is with tidyr::unnest_longer()
. While it's great for small data objects, I find it too slow when dealing with very large data. Hence, I'm looking for a speedier alternative.
I've seen one answer that brings me close, but not exactly what I need: a base R alternative to unnest_wider()
but much faster. However, I'm looking for a fast solution that mimics what unnest_longer does.
Be it a solution that's based on base R
, data.table
, rrapply
, or collapse
-- all are welcome, as long as it can cut down the processing time.
Reproducible Example
data
library(stringi)
library(tidyr)
library(magrittr, warn.conflicts = FALSE)
# simulate data
set.seed(123)
vec_n <- 1e6
vec_vals <- 1:vec_n
vec_names <- stringi::stri_rand_strings(vec_n, 5)
my_named_vec <- setNames(vec_vals, vec_names)
split_func <- function(x, n) {
unname(split(x, rep_len(1:n, length(x))))
}
my_tbl <-
tibble(col_1 = sample(split_func(my_named_vec, n = vec_n / 5)))
So my_tbl
is the given data object that I need to "unnest". A brief preview of the data structure reveals a nested named vector in each row of column col_1
.
# preview my_tbl
my_tbl
#> # A tibble: 200,000 x 1
#> col_1
#> <list>
#> 1 <int [5]>
#> 2 <int [5]>
#> 3 <int [5]>
#> 4 <int [5]>
#> 5 <int [5]>
#> 6 <int [5]>
#> 7 <int [5]>
#> 8 <int [5]>
#> 9 <int [5]>
#> 10 <int [5]>
#> # ... with 199,990 more rows
head(my_tbl$col_1)
#> [[1]]
#> 9YAGC hTjlr vgxjQ y4qG2 R1fUE
#> 56356 256356 456356 656356 856356
#>
#> [[2]]
#> nz5rk ZvEe6 ustHv 2TdA8 Rreqn
#> 119257 319257 519257 719257 919257
#>
#> [[3]]
#> ubbWp aw6zR ab0Ax N747j GY1xU
#> 4663 204663 404663 604663 804663
#>
#> [[4]]
#> JHo4w otk4s BTZ3h zlAKU svSgH
#> 75297 275297 475297 675297 875297
#>
#> [[5]]
#> A1pxZ T7y0l 0ixE2 DRBxP IBqxe
#> 19495 219495 419495 619495 819495
#>
#> [[6]]
#> fDkau Z7tmy TIzgx nKANU Bqwo1
#> 184074 384074 584074 784074 984074
"Unnesting"
if we just had 10 rows
my_tbl[1:10, ] %>%
tidyr::unnest_longer(col_1)
#> # A tibble: 50 x 2
#> col_1 col_1_id
#> <int> <chr>
#> 1 56356 9YAGC
#> 2 256356 hTjlr
#> 3 456356 vgxjQ
#> 4 656356 y4qG2
#> 5 856356 R1fUE
#> 6 119257 nz5rk
#> 7 319257 ZvEe6
#> 8 519257 ustHv
#> 9 719257 2TdA8
#> 10 919257 Rreqn
#> # ... with 40 more rows
But in my_tbl
there are 200,000 rows, so this tidyr::unnest_longer()
takes a lot of time:
my_benchmark <-
bench::mark(tidyrunnestlonger =
my_tbl %>%
tidyr::unnest_longer(col_1)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
my_benchmark
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidyrunnestlonger 2.47m 2.47m 0.00674 309MB 1.50
Created on 2021-08-23 by the reprex package (v2.0.0)
And whereas my_tbl
has 200,000 rows, my real data has over 1 million rows in such kind of format. So I'm looking for the fastest solution possible.
Thanks!