I'm trying to merge 80+ dataframes in R. These data frames combined have about 4 million rows and 413 columns.
Is this feasible on a 12 core 16GB machine?
In the following code, I make an attempt to merge the dataframes with an inner join. When running this code, R-Studio memory usage grows past 16GB, causing R-Studio to crash. Is there a more efficient way to merge? I'm sure this Reduce function is not great.
Note: dbdataset is a package which acts as a wrapper around this database with approximately 80 tables. To get the dataframe names I used vcdExtra::datasets. This function lists all of the variable names for the dataframes in the package as strings.
This quickest/easiest way I could think to do this was to send these names to parse test -> eval, shown below. But this crashes R-Studio.
# devtools::install_github("MohammedFCIS/dbdataset")
library(dbdataset)
library(vcdExtra)
library(magrittr)
data <- vcdExtra::datasets("dbdataset")
df <- data$Item %>%
lapply( function(i) i %>% parse(text = .) %>%
eval()) %>%
Reduce(function(x, y) merge(x, y), .)
Counting columns and rows:
library(stringr)
data <- vcdExtra::datasets("dbdataset")
data$dim %>% str_split("x") %>% lapply(function(x) x[[2]]) %>% unlist %>% as.numeric %>% sum
# 413 total columns
data$dim %>% str_split("x") %>% lapply(function(x) x[[1]]) %>% unlist %>% as.numeric %>% sum
# 4341477 rows
Edit: Thanks to M.Viking for pointing out some of the tables have different ids linking them. For now I updated my code with some suggestions below, but it still crashes. This version is only joining 9 tables though. I also changed to a full-join.
subset <- data$Item %>%
lapply( function(i) i %>% parse(text = .) %>% eval()) %>%
lapply(function(i) 'drugbank-id' %in% names(i)) %>%
unlist
X <- data$Item[subset] %>%
lapply( function(i) i %>% parse(text = .) %>% eval()) %>%
purrr::reduce( function(x, y) dplyr::full_join(x, y, by = "drugbank-id"))