Most efficient way to bind data frames (over 10^8 columns) based on column names

Question

What is the most efficient way to rbind data frames based on column names? All data frames do not have the same column names, so I expect NA values to be introduced in this process.

Here is a reproducible example of what I am talking about, but keep in mind that the data frame size is 1 row by ~10^8 columns for each data frame. I have a list of 100 data frames like this.

a <- as.data.frame(t(as.data.frame(c(1, 4, 5, 3, 7, 3, 5, 6))))
rownames(a) <- NULL
colnames(a) <- c("AA", "DD", "CD", "KD", "DSF", "DFS", "RF")

b <- as.data.frame(t(as.data.frame(c(4, 7, 3, 2, 7, 3)))
rownames(b) <- NULL
colnames(b) <- c("AA", "DFS", "CD", "UF", "KD", "DD")


c <- as.data.frame(t(as.data.frame(c(2, 4, 7, 3,)))
rownames(c) <- NULL
colnames(c) <- c("AA", "NF", "CD", "UF")

list <- list(a, b, c)

Thanks!

`bind_rows` will definitely do the job, but a data frame with 100 million columns seems prohibitively large. Are you sure the data absolutely must be stored that way? — jdobres, Apr 08 '18 at 15:48
Yes I have over 100 million features to then run feature selection on. If you have any suggestions for feature selection on this size dataset that would be appreciated also, though out of the scope of this question. — Keshav M, Apr 08 '18 at 15:52
Data frames carry a lot of computational overhead. I would recommend storing these data in a matrix instead. — jdobres, Apr 08 '18 at 15:58
@jdobres Neither of the below solutions seem to work for matrices. Do you have any suggestions that would work on a matrix? — Keshav M, Apr 08 '18 at 16:07

score 2 · Accepted Answer · answered Apr 08 '18 at 15:46

2

We can use bind_rows

library(dplyr)
bind_rows(list)

Or rbindlist from data.table

library(data.table)
rbindlist(list, fill = TRUE)

answered Apr 08 '18 at 15:46

akrun

874,273
37
540
662

1

Just want to add a bit more to this: `rbindlist` can be much faster than `bind_rows` and `rbind` http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/ – Tung Apr 08 '18 at 15:57
1

This wasn't in the question, but as matrices were suggested, do you have any solutions that would work for matrices? The above ones seem to not work with matrices. Thanks! – Keshav M Apr 08 '18 at 16:10
1

@KeshavM: read the link I posted. That option is there – Tung Apr 08 '18 at 16:51

Most efficient way to bind data frames (over 10^8 columns) based on column names

1 Answers1