33

I often find questions where people have somehow ended up with an unnamed list of unnamed character vectors and they want to bind them row-wise into a data.frame. Here is an example:

library(magrittr)
data <- cbind(LETTERS[1:3],1:3,4:6,7:9,c(12,15,18)) %>%
  split(1:3) %>% unname
data
#[[1]]
#[1] "A"  "1"  "4"  "7"  "12"
#
#[[2]]
#[1] "B"  "2"  "5"  "8"  "15"
#
#[[3]]
#[1] "C"  "3"  "6"  "9"  "18"

One typical approach is with do.call from base R.

do.call(rbind, data) %>% as.data.frame
#  V1 V2 V3 V4 V5
#1  A  1  4  7 12
#2  B  2  5  8 15
#3  C  3  6  9 18

Perhaps a less efficient approach is with Reduce from base R.

Reduce(rbind,data, init = NULL) %>% as.data.frame
#  V1 V2 V3 V4 V5
#1  A  1  4  7 12
#2  B  2  5  8 15
#3  C  3  6  9 18

However, when we consider more modern packages such as dplyr or data.table, some of the approaches that might immediately come to mind don't work because the vectors are unnamed or aren't a list.

library(dplyr)
bind_rows(data)
#Error: Argument 1 must have names
library(data.table)
rbindlist(data)
#Error in rbindlist(data) : 
#  Item 1 of input is not a data.frame, data.table or list

One approach might be to set_names on the vectors.

library(purrr)
map_df(data, ~set_names(.x, seq_along(.x)))
# A tibble: 3 x 5
#  `1`   `2`   `3`   `4`   `5`  
#  <chr> <chr> <chr> <chr> <chr>
#1 A     1     4     7     12   
#2 B     2     5     8     15   
#3 C     3     6     9     18  

However, this seems like more steps than it needs to be.

Therefore, my question is what is an efficient tidyverse or data.table approach to binding an unnamed list of unnamed character vectors into a data.frame row-wise?

user438383
  • 5,716
  • 8
  • 28
  • 43
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • 2
    As a side note, `Reduce(rbind, ` cannot be more efficient than `do.call(rbind, ` since the `do.call` construct allocates memory and copies data once, while the `Reduce` construct repeatedly allocates new memory and re-copies all previously "`rbind`ed" elements. – alexis_laz May 06 '20 at 08:36
  • You're quite correct. I didn't expect the performance hit as bad as it is, 6,000 times slower on 100,000 rows. I edited the question to call this a "less efficient approach". – Ian Campbell May 06 '20 at 13:14

8 Answers8

15

Not entirely sure about efficiency, but a compact option using purrr and tibble could be:

map_dfc(purrr::transpose(data), ~ unlist(tibble(.)))

  V1    V2    V3    V4    V5   
  <chr> <chr> <chr> <chr> <chr>
1 A     1     4     7     12   
2 B     2     5     8     15   
3 C     3     6     9     18  
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • 1
    @Adam updated the post, thank you :) I cannot recall a `tidyverse` function that is faster or as fast as a `data.table` function for the same thing, though. – tmfmnk May 08 '20 at 20:44
11

Edit

Use @sindri_baldur's approach: https://stackoverflow.com/a/61660119/8583393


A way with data.table, similar to what @tmfmnk showed

library(data.table)
as.data.table(transpose(data))
#   V1 V2 V3 V4 V5
#1:  A  1  4  7 12
#2:  B  2  5  8 15
#3:  C  3  6  9 18
markus
  • 25,843
  • 5
  • 39
  • 58
10
library(data.table)
setDF(transpose(data))

  V1 V2 V3 V4 V5
1  A  1  4  7 12
2  B  2  5  8 15
3  C  3  6  9 18
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • 4
    I just ran a benchmark with some other methods. This crushes everything else in terms of speed and is the first one to actually beat the `base::rbind()` solution. –  May 07 '20 at 14:21
  • 3
    @dww Yes, but `setDF()` is different from `as.data.table()` / `as.data.frame()`. – s_baldur May 08 '20 at 08:00
  • 1
    @Adam, Do you think you could update your benchmark with the newer solution? For those unaware of how `setDF()`/`setDT()` work then here is good post: https://stackoverflow.com/a/44938350/4552295 – s_baldur May 10 '20 at 10:00
9

This seems rather compact. I believe this is what powers bind_rows() from dplyr and therefore map_df() in purrr, so should be fairly efficient.

library(vctrs)

vec_rbind(!!!data)

This gives a data.frame.

  ...1 ...2 ...3 ...4 ...5
1    A    1    4    7   12
2    B    2    5    8   15
3    C    3    6    9   18

Some Benchmarks

It seems like the .name_repair within the tidyverse methods is a severe bottleneck. I took a few fairly straightforward options that also seemed to run the quickest from the other posts (thanks H 1 and sindri_baldur).

microbenchmark(vctrs = vec_rbind(!!!data),
               dt = rbindlist(lapply(data, as.list)),
               map = map_df(data, as_tibble_row, .name_repair = "unique"),
               base = as.data.frame(do.call(rbind, data)))

benchmark 1

But if you first name the vectors (but not necessarily the list elements), you get a different story.

data2 <- modify(data, ~set_names(.x, seq(.x)))

microbenchmark(vctrs = vec_rbind(!!!data2),
               dt = rbindlist(lapply(data2, as.list)),
               map = map_df(data2, as_tibble_row),
               base = as.data.frame(do.call(rbind, data2)))

benchmark 2

In fact, you can include the time to name the vectors into the vec_rbind() solution and not the others, and still see fairly high performance.

microbenchmark(vctrs = vec_rbind(!!!modify(data, ~set_names(.x, seq(.x)))),
               dt = setDF(transpose(data)),
               map = map_df(data2, as_tibble_row),
               base = as.data.frame(do.call(rbind, data)))

final benchmark

For what its worth.

  • 1
    You might further improve performance by setting the names to just an integer that doesn't require `paste`. – Ian Campbell May 07 '20 at 14:24
  • 1
    Maybe something like `vctrs::vec_rbind(!!!lapply(data,function(x){attr(x,"names") <- 1:5; x}))`. But for answering everyday questions that people can understand, this is less than ideal. – Ian Campbell May 07 '20 at 14:33
  • 1
    Yeah, that gets a bit quicker than what I just did. But I agree. I am tempted to open a feature request in `vctrs` to see if they can resolve the names ahead of time. I am out of play time for this. But this is an interesting problem. Feel free to edit this post with benchmarks, take them and move them into another post, or anything you like. But I think the setDF() option will be your winner. –  May 07 '20 at 14:40
6

My approach would be to just turn those list entries into expected type

rbindlist(lapply(data, as.list))
#       V1     V2     V3     V4     V5
#   <char> <char> <char> <char> <char>
#1:      A      1      4      7     12
#2:      B      2      5      8     15
#3:      C      3      6      9     18

If you want your data types to be adjusted from character vector to appropriate types, then lapply can help here as well. First lapply is called for every row, second lapply is called for every column.

rbindlist(lapply(data, as.list))[, lapply(.SD, type.convert)]
       V1    V2    V3    V4    V5
   <fctr> <int> <int> <int> <int>
1:      A     1     4     7    12
2:      B     2     5     8    15
3:      C     3     6     9    18
jangorecki
  • 16,384
  • 4
  • 79
  • 160
5

An option with unnest_wider

library(tibble)
library(tidyr)
library(stringr)
tibble(col = data) %>%
    unnest_wider(c(col), names_repair = ~ str_c('value', seq_along(.)))
# A tibble: 3 x 5
#  value1 value2 value3 value4 value5
#  <chr>  <chr>  <chr>  <chr>  <chr> 
#1 A      1      4      7      12    
#2 B      2      5      8      15    
#3 C      3      6      9      18    
akrun
  • 874,273
  • 37
  • 540
  • 662
3

Here is a slight variation on tmfmnk's suggested approach using as_tibble_row() to convert the vectors into single row tibbles. It's also necessary to use the .name_repair argument:

library(purrr)
library(tibble)

map_df(data, as_tibble_row, .name_repair = ~paste0("value", seq(.x)))

# A tibble: 3 x 5
  value1 value2 value3 value4 value5
  <chr>  <chr>  <chr>  <chr>  <chr> 
1 A      1      4      7      12    
2 B      2      5      8      15    
3 C      3      6      9      18
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
1

I think this could be added to an already complete set of very good answers to this question:

library(rlang) # Or purrr

data %>%
  exec(rbind, !!!.) %>%
  as_tibble() %>%
  set_names(~ letters[seq_along(.)])

# A tibble: 3 x 5
  a     b     c     d     e    
  <chr> <chr> <chr> <chr> <chr>
1 A     1     4     7     12   
2 B     2     5     8     15   
3 C     3     6     9     18  
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41