0

I'm trying to merge 80+ dataframes in R. These data frames combined have about 4 million rows and 413 columns.

Is this feasible on a 12 core 16GB machine?

In the following code, I make an attempt to merge the dataframes with an inner join. When running this code, R-Studio memory usage grows past 16GB, causing R-Studio to crash. Is there a more efficient way to merge? I'm sure this Reduce function is not great.

Note: dbdataset is a package which acts as a wrapper around this database with approximately 80 tables. To get the dataframe names I used vcdExtra::datasets. This function lists all of the variable names for the dataframes in the package as strings.

This quickest/easiest way I could think to do this was to send these names to parse test -> eval, shown below. But this crashes R-Studio.

# devtools::install_github("MohammedFCIS/dbdataset")
library(dbdataset)
library(vcdExtra)
library(magrittr)
data <- vcdExtra::datasets("dbdataset")
df <- data$Item %>% 
    lapply( function(i) i %>% parse(text = .) %>% 
    eval()) %>%
    Reduce(function(x, y) merge(x, y), .)

Counting columns and rows:

library(stringr)
data <- vcdExtra::datasets("dbdataset")
data$dim %>% str_split("x") %>% lapply(function(x) x[[2]]) %>% unlist %>% as.numeric %>% sum
# 413 total columns
data$dim %>% str_split("x") %>% lapply(function(x) x[[1]]) %>% unlist %>% as.numeric %>% sum
# 4341477 rows

Edit: Thanks to M.Viking for pointing out some of the tables have different ids linking them. For now I updated my code with some suggestions below, but it still crashes. This version is only joining 9 tables though. I also changed to a full-join.


subset <- data$Item  %>% 
    lapply( function(i) i %>% parse(text = .) %>% eval()) %>% 
    lapply(function(i) 'drugbank-id' %in% names(i)) %>% 
    unlist

X <- data$Item[subset] %>% 
    lapply( function(i) i %>% parse(text = .) %>% eval()) %>%
    purrr::reduce( function(x, y) dplyr::full_join(x, y, by = "drugbank-id"))
Frank
  • 952
  • 1
  • 9
  • 23
  • 1
    A possible solution is using some sort of SQL database and `dbplyr`. See: https://dbplyr.tidyverse.org/ – csgroen May 17 '21 at 15:52
  • 1
    But first: don't use merge, try the `dplyr` join functions and `purrr::reduce`. – csgroen May 17 '21 at 15:54
  • 1
    Or, alternatively, switch to `data.table`. – slamballais May 17 '21 at 15:56
  • @slamballais does using `data.table` help with memory usage? – bird May 17 '21 at 15:58
  • 16GB sounds _very_ little to me. I can point to my related [analysis](https://stackoverflow.com/q/55077668/6574038) and wouldn't point to any package at haphazard. If RAM is exhausted, it will probably crash and waste your time. SQL would be worth thinking about, or ask a computer scientist to calculate for you the needed amount of RAM. – jay.sf May 17 '21 at 16:03
  • 1
    I looked at the source data, and I do not think this merge approach will work. For example, two randomly selecting files `AHFS_Codes_Drug` and `PFAMS_Polypeptides_Enzyme_Drug` do not share a common key. You may need to code a deliberate approach to merging these datasets. – M.Viking May 17 '21 at 16:13
  • 1
    @bird It can, see e.g. https://stackoverflow.com/questions/56013991/efficiently-merging-large-data-tables – slamballais May 17 '21 at 16:18
  • @M.Viking You're right. I'm going to have to edit the join. I'll update the question tomorrow. Checking for the key I get only 15 dataframes. But there are other keys in the main Drugs dataframe that should link the others. ```data$Item %>% lapply( function(i) i %>% parse(text = .) %>% eval()) %>% lapply(function(i) 'drugbank_id' %in% names(i) | 'drugbank-id' %in% names(i)) %>% unlist %>% sum``` # = 15 – Frank May 17 '21 at 16:31

0 Answers0