R - Merging Multiple Large Dataframes (Collating)

Question

I'd prefer to do the following in R, but am open to (easy to learn) other solutions.

I have multiple (lets say 99) tab-delimited files (let's call them S1.txt through S99.txt) with tables, all with the exact same format. Each table is ~2,000,000 cols by 5 rows. Here's a toy example:

ID    Chr    Position    DP1    DP2
A1    1       123        1.5    2.0
A2    1       124        1.4    0.3

ID by definition is unique and always in the same order, Chr and Pos are always in the same order. The only things different in each input file are DP1 column and DP2 column. The output table I'd like to be "collated", I think is the word. Here's an example of the output if there were ONLY 3 Sample input files.

ID    Chr    Position    S1.DP1  S1.DP2    S2.DP1    S2.DP2  S3.DP1  S3.DP2
A1    1       123        1.5      2.0       1.2        2.0     1.5     2.1
A2    1       124        1.4      0.3       1.0        0.5     0.5     0.05

Notice that each input file has a new column created for DP1 and DP2. ALSO, the name of the columns is informative (tells me which input file it came from & which datapoint - DP).

I've found questions for when the columns are different: R: merging a lot of data.frames I'm also aware of merge, although I feel like you end up with strange column names: How to join (merge) data frames (inner, outer, left, right)?

My other solution has been to initialize a dataframe and then load each file and add the data points, but this would use a loop and be incredibly slow and horrible. So, I need a more elegant solution. Thank you for your help.

Jacob H · Accepted Answer · 2015-12-16T02:23:21.603

3

I re-read your question and thought of an even better solution.

First off all I would not load all the .txt files into R at once. If your .txt files are 2e6x5 and there are a 100 of them you are likely going to run out of RAM before you load them all. I would load them one at a time and iteratively merge them.

library(readr) #Use this to load your data, it is much better than the base functions

f <- list.files(path = "path/to/file", pattern = "*.txt", full.names = TRUE)

d <- read_delim(f[1], delim = "\t") 

idx = c("ID", "Chr", "Position")

for (i in seq(2, length(f)){

    d_temp <- read_delim(f[i], delim = "\t")

    d <- merge(d, d_temp, by = idx)

    rm(d_temp) #not necessary but I like to include to make explicit
}

Naming d

n <- expand.grid(paste0("S", seq(1, length(f)), c("DP1", "DP2"))
names(d)[!names(d) %in% idx] <- paste(n[ ,1], n[ ,2], sep = ".")

Update

Ugh I missed the obvious, if you truly have 100 2e6x5 .txt files, you are probably not going to be able to use R for this task. I doubt it will be possible to store a 2e6X500 data frame in R. Even if you on a server with loads of RAM computation time will be non-trivial. I think the most important question going forward is what are you trying to do with this data. Once you answer this you might be able to efficiently use your data.

edited Dec 16 '15 at 02:23

answered Dec 16 '15 at 01:45

Jacob H

4,317
2
32
39

I am getting all of the samples into one file to format them for a program. It requires the files to be in the format I had. I was worried this would be too much for R, but don't know any other way to do it. I had an upstream program create the file per sample, but now need to put them all together for a downstream program. – Gaius Augustus Dec 16 '15 at 18:26
Is there another option for this that's relatively straightforward? I don't mind learning new things, but also need to get this done for a project. Thanks. – Gaius Augustus Dec 16 '15 at 18:32
1

@GaiusAugustus The first thing I would do is try one of the suggested approaches and see if it works. However Unix will be your fastest option. To do this use the `cut` command to remove unique identifiers (i.e. ID, Chr, Position) from S2.txt to S99.txt. Then `paste` together S1.txt through S99.txt. Unix will not only be faster, but it is never memory bound. – Jacob H Dec 16 '15 at 20:33
I didn't know about the paste command! Woah, that changes my world. Thank you! I'll follow this advice. – Gaius Augustus Dec 17 '15 at 04:12

score 2 · Answer 2 · answered Dec 16 '15 at 00:23

I'm going to assume that all the files are stored in a single folder and that you want to load all the files with .txt extensions in that folder.

## List all the files in the current directory that end in .txt
files <- list.files(path = ".", pattern = "*.txt")

## Load them into a list called datlist and name each element after the file it came from
datlist <- lapply(files, read.table, sep = "\t")
names(datlist) <- gsub("(*).txt", "\\1", files)

However for the purposes of a reproducible example I'm going to manually create a list of data frames like the one you showed.

S1 <- read.table(text = "ID    Chr    Position    DP1    DP2
A1    1       123        1.5    2.0
A2    1       124        1.4    0.3", header = TRUE)

S2 <- read.table(text = "ID    Chr    Position    DP1    DP2
A1    1       123        1.2    2.0
A2    1       124        1.0    0.5", header = TRUE)

S3 <- read.table(text = "ID    Chr    Position    DP1    DP2
A1    1       123        1.5    2.1
A2    1       124        0.5    0.05", header = TRUE)

datlist <- list(S1 = S1, S2 = S2, S3 = S3)

Now load the packages we're going to use

library("dplyr")
library("tidyr")

With a mix of dplyr and tidyr functions we can get the result you want:

## First, combine the list into a single data frame, adding a column to indicate
## which file each row came from
bind_rows(datlist, .id = "file") %>%
  ## Gather this into a longer format with DP1/DP2 as variables
  gather(key = col, value = value, which(!names(.) %in% c("ID", "Chr", "Position", "file"))) %>%
  ## Create a new column that combines the file name and DP1/DP2 -- this will be
  ## the final column names
  unite(newcol, file, col, sep = ".") %>%
  ## Spread the data so that each combination of file and DP1/DP2 is its own
  ## column
  spread(newcol, value)

End result:

## Source: local data frame [2 x 9]

##       ID   Chr Position S1.DP1 S1.DP2 S2.DP1 S2.DP2 S3.DP1 S3.DP2
##   (fctr) (int)    (int)  (dbl)  (dbl)  (dbl)  (dbl)  (dbl)  (dbl)
## 1     A1     1      123    1.5    2.0    1.2    2.0    1.5   2.10
## 2     A2     1      124    1.4    0.3    1.0    0.5    0.5   0.05

Jacob H · Answer 3 · 2015-12-16T01:48:38.880

2

A one liner with base R

l = list(S1=S1, S2=S2, S3=S3)

idx = c("ID","Chr","Position")

d <- Reduce(function(x, y) merge(x, y, by = idx), l)

Update

Forgot the variable names. This might be a bit excessive but it is the best way I can think of to avoid hard coding the names.

 n <- expand.grid(names(l), setdiff(names(S1), idx))
 names(d)[!names(d)%in%idx] <- paste(n[ ,1], n[ ,2], sep = ".")

edited Dec 16 '15 at 01:48

answered Dec 16 '15 at 00:54

Jacob H

4,317
2
32
39

Why give two separate answers? It's better to include them in one imo – Jaap Dec 16 '15 at 07:36

R - Merging Multiple Large Dataframes (Collating)

3 Answers3