I am currently using the tidyr
package to unnest list columns. However, I am looking for a faster approach and turned to data.table
(where I am a noob). Consider the following example:
dt1 <- data.table::data.table(
a = c("a1", "a2"),
df1 = list(data.frame(
b = c("b1", "b2")
))
)
tidyr::unnest(dt1, df1)
#> # A tibble: 4 x 2
#> a b
#> <chr> <chr>
#> 1 a1 b1
#> 2 a1 b2
#> 3 a2 b1
#> 4 a2 b2
dt1[, data.table::rbindlist(df1), by = .(a)]
#> a b
#> 1: a1 b1
#> 2: a1 b2
#> 3: a2 b1
#> 4: a2 b2
Created on 2021-06-22 by the reprex package (v1.0.0)
I get the same result, but if I have a large data.table
and more columns in by
this approach gives worse performance with data.table
than with tidyr
. Can this be alleviated?
One follow-up question is how to unnest multiple columns with data.table
. Consider this example:
dt2 <- data.table::data.table(
a = c("a1", "a2"),
df1 = list(data.frame(
b = c("b1", "b2")
)),
df2 = list(data.frame(
c = c("c1", "c2")
))
)
tidyr::unnest(dt2, c(df1, df2))
#> # A tibble: 4 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 a1 b1 c1
#> 2 a1 b2 c2
#> 3 a2 b1 c1
#> 4 a2 b2 c2
Created on 2021-06-22 by the reprex package (v1.0.0)
Using multiple arguments in data.table::rbindlist
doesn't seem to work.
Update: After making a large(r) example to demonstrate my claim about execution time it turns out that tidyr
is quite sensitive to whether the list column contains data.frame
s or data.table
s:
n_inner <- 300
inner_df <- data.frame(
d1 = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_inner),
d2 = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_inner),
d3 = rnorm(n_inner)
)
n_outer <- 400
dt <- data.table::data.table(
a = sample(10, n_outer, replace = TRUE),
b = seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2021-01-01"), length.out = n_outer),
c = seq.POSIXt(as.POSIXct("2019-01-01"), as.POSIXct("2020-01-01"), length.out = n_outer),
d = rep(list(inner_df), n_outer)
)
bench::mark(check = FALSE,
tidyr = tidyr::unnest(dt, d),
datatable = dt[, data.table::rbindlist(d), by = .(a, b, c)]
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidyr 14ms 18.7ms 53.2 18MB 26.6
#> 2 datatable 56.2ms 56.2ms 17.8 25.5MB 178.
inner_dt <- data.table::as.data.table(inner_df)
dt$d <- rep(list(inner_dt), n_outer)
bench::mark(check = FALSE,
tidyr = tidyr::unnest(dt, d),
datatable = dt[, data.table::rbindlist(d), by = .(a, b, c)]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidyr 202.2ms 209.3ms 4.40 28.4MB 19.1
#> 2 datatable 43.5ms 49.9ms 18.3 25.4MB 22.0
Created on 2021-06-22 by the reprex package (v1.0.0)
In my actual usecase I have nested data.frame
s since it comes from JSON parsed with RcppSimdJson
and here tidyr
is faster.