4

I have a large data set, which has a list-column that contains nested named vectors. I want to unnest those vectors into two new columns:

  • one column for the vector elements
  • another column for the vector's element names

Right now, the only straightforward way I've found is with tidyr::unnest_longer(). While it's great for small data objects, I find it too slow when dealing with very large data. Hence, I'm looking for a speedier alternative.

I've seen one answer that brings me close, but not exactly what I need: a base R alternative to unnest_wider() but much faster. However, I'm looking for a fast solution that mimics what unnest_longer does.

Be it a solution that's based on base R, data.table, rrapply, or collapse -- all are welcome, as long as it can cut down the processing time.

Reproducible Example

data

library(stringi)
library(tidyr)
library(magrittr, warn.conflicts = FALSE)

# simulate data
set.seed(123)
vec_n <- 1e6
vec_vals  <- 1:vec_n
vec_names <- stringi::stri_rand_strings(vec_n, 5)

my_named_vec <- setNames(vec_vals, vec_names)

split_func <- function(x, n) {
  unname(split(x, rep_len(1:n, length(x))))
} 

my_tbl <-
  tibble(col_1 = sample(split_func(my_named_vec, n = vec_n / 5)))

So my_tbl is the given data object that I need to "unnest". A brief preview of the data structure reveals a nested named vector in each row of column col_1.

# preview my_tbl
my_tbl
#> # A tibble: 200,000 x 1
#>    col_1    
#>    <list>   
#>  1 <int [5]>
#>  2 <int [5]>
#>  3 <int [5]>
#>  4 <int [5]>
#>  5 <int [5]>
#>  6 <int [5]>
#>  7 <int [5]>
#>  8 <int [5]>
#>  9 <int [5]>
#> 10 <int [5]>
#> # ... with 199,990 more rows

head(my_tbl$col_1)
#> [[1]]
#>  9YAGC  hTjlr  vgxjQ  y4qG2  R1fUE 
#>  56356 256356 456356 656356 856356 
#> 
#> [[2]]
#>  nz5rk  ZvEe6  ustHv  2TdA8  Rreqn 
#> 119257 319257 519257 719257 919257 
#> 
#> [[3]]
#>  ubbWp  aw6zR  ab0Ax  N747j  GY1xU 
#>   4663 204663 404663 604663 804663 
#> 
#> [[4]]
#>  JHo4w  otk4s  BTZ3h  zlAKU  svSgH 
#>  75297 275297 475297 675297 875297 
#> 
#> [[5]]
#>  A1pxZ  T7y0l  0ixE2  DRBxP  IBqxe 
#>  19495 219495 419495 619495 819495 
#> 
#> [[6]]
#>  fDkau  Z7tmy  TIzgx  nKANU  Bqwo1 
#> 184074 384074 584074 784074 984074

"Unnesting"

if we just had 10 rows

my_tbl[1:10, ] %>%
  tidyr::unnest_longer(col_1)
#> # A tibble: 50 x 2
#>     col_1 col_1_id
#>     <int> <chr>   
#>  1  56356 9YAGC   
#>  2 256356 hTjlr   
#>  3 456356 vgxjQ   
#>  4 656356 y4qG2   
#>  5 856356 R1fUE   
#>  6 119257 nz5rk   
#>  7 319257 ZvEe6   
#>  8 519257 ustHv   
#>  9 719257 2TdA8   
#> 10 919257 Rreqn   
#> # ... with 40 more rows

But in my_tbl there are 200,000 rows, so this tidyr::unnest_longer() takes a lot of time:

my_benchmark <-
  bench::mark(tidyrunnestlonger = 
              my_tbl %>%
              tidyr::unnest_longer(col_1)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

my_benchmark
#> # A tibble: 1 x 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidyrunnestlonger    2.47m    2.47m   0.00674     309MB     1.50

Created on 2021-08-23 by the reprex package (v2.0.0)


And whereas my_tbl has 200,000 rows, my real data has over 1 million rows in such kind of format. So I'm looking for the fastest solution possible.

Thanks!

Emman
  • 3,695
  • 2
  • 20
  • 44
  • 2
    There are several links in the `data.table` feature request [Create an efficient unnest function](https://github.com/Rdatatable/data.table/issues/2146) which may be helpful. Perhaps also [proper nest/unnest functions](https://github.com/Rdatatable/data.table/issues/3672) – Henrik Aug 23 '21 at 19:37
  • So what's wrong with `unlist`? it runs instantly – David Arenburg Sep 05 '21 at 15:43

2 Answers2

6

You can just unlist your column:

x<-unlist(my_tbl[[1]])
res<-tibble(col_1=unname(x),col_1_id=names(x))
res
## A tibble: 1,000,000 x 2
#    col_1 col_1_id
#    <int> <chr>   
# 1  56356 9YAGC   
# 2 256356 hTjlr   
# 3 456356 vgxjQ   
# 4 656356 y4qG2   
# 5 856356 R1fUE   
# 6 119257 nz5rk   
# 7 319257 ZvEe6   
# 8 519257 ustHv   
# 9 719257 2TdA8   
#10 919257 Rreqn   
## … with 999,990 more rows
nicola
  • 24,005
  • 3
  • 35
  • 56
1

You could try tibble::enframe on each element, then unnest. I was poking around this answer looking for ways to convert all of the named vectors into dataframes before flattening: Convert Named Character Vector to data.frame.

my_tbl %>%
  mutate(col_1 = map(col_1, enframe)) %>%
  unnest(col_1)

A quick benchmark

library(tidyverse) # purrr, tibble, dplyr, tidyr

microbenchmark::microbenchmark(
  tidyrunnestlonger = 
    my_tbl %>%
    unnest_longer(col_1),
  enframe = 
    my_tbl %>%
    mutate(col_1 = map(col_1, enframe)) %>%
    unnest(col_1),
  times = 1
)

#-------------
Unit: seconds
              expr        min         lq       mean     median         uq        max neval
 tidyrunnestlonger 101.419181 101.419181 101.419181 101.419181 101.419181 101.419181     1
           enframe   6.140771   6.140771   6.140771   6.140771   6.140771   6.140771     1

nniloc
  • 4,128
  • 2
  • 11
  • 22