0

I am using a for loop to merge multiple files with another file:

files <- list.files("path", pattern=".TXT", ignore.case=T)

for(i in 1:length(files))
{
  data <- fread(files[i], header=T)

  # Merge
  mydata <- merge(mydata, data, by="ID", all.x=TRUE)

  rm(data)
}

"mydata" looks as follows (simplified):

ID  x1  x2
1   2   8
2   5   5
3   4   4
4   6   5
5   5   8

"data" looks as follows (around 600 files, in total 100GB). Example of 2 (seperate) files. Integrating all in 1 would be impossible (too large):

ID  x3
1   8
2   4

ID  x3
3   4
4   5
5   1

When I run my code I get the following dataset:

ID  x1  x2  x3.x    x3.y
1   2   8   8       NA
2   5   5   4       NA
3   4   4   NA      4
4   6   5   NA      5
5   5   8   NA      1

What I would like to get is:

ID  x1  x2  x3
1   2   8   8
2   5   5   4
3   4   4   4
4   6   5   5
5   5   8   1

ID's are unique (never duplicates over the 600 files).

Any idea on how to achieve this as efficiently as possible much appreciated.

research111
  • 347
  • 5
  • 18
  • Are you trying to merge all the text files into single text file ? If the ID are unique, why not use `rbind` or `cbind` for joining them instead of merging ? – user5249203 Mar 04 '16 at 15:03
  • I don't think there is a function that does 'merging' the way you want it for your data structure. We might have to write one. Your `data` file does not necessarily always only contain the column 'x3' right? – Vlo Mar 04 '16 at 15:25
  • data is always the exact same, 1 column "ID", 1 column "x3". I only have more IDs in the seperate data files, than in mydata – research111 Mar 04 '16 at 15:54

1 Answers1

5

It's better suited as comment, But I can't comment yet.

Would it not be better to rbind instead of merge? This seems to be what you want to acomplish.

Set fill argument TRUE to take care of different column numbers:

asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)

   x1 x2
1:  1  4
2:  2  5
3: NA  5

Do this with data and then merge into mydata by ID.

Update for comment

files <- list.files("path", pattern=".TXT", ignore.case=T)

ff <- function(input){
  data <- fread(input) 
}

a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))

So, this creates a function to read files and pushes it to lapply, so you will get a list containing all your data files, each on its own dataframe.

With ldply from plyr rbind all dataframes into one dataframe.

Don't touch mydata yet.

binded.data <- data.table(binded.data, key = ID)

Depending on your mydata you will perform different merge commands. See: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

Update 2

files <- list.files("path", pattern=".TXT", ignore.case=T)

ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))

Update 3

You can add cat to see the file the function is reading right now. So you can see after which file you are running out of memory. Which will point you to the direction on how many files you can read in one go.

  ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}
Jav
  • 2,203
  • 13
  • 22
  • Sorry, I'm not sure whether I understand it correctly. Do you mean rbind all individual "data"'s? I tried rbind mydata and data in the loop, but this doesn't work – research111 Mar 06 '16 at 21:27
  • It works on the example, but it does not work on my data. It is not possible to get 1 data table for all my 'data' files (=100GB), I run into memory issues. Would it be possible to merge sequentially (so for each "data" with "mydata"), as mydata is substantially smaller. Or add only observations that match "mydata" in this binded.data – research111 Mar 07 '16 at 17:26
  • Are you able to create the list? "a <- lapply(files, ff)" part. Or do you run into memory issues with this one as well? Also see update 2, adds only the rows whose ID matches ID of mydata – Jav Mar 07 '16 at 19:57
  • I run into memory issues with "a<-lapply(files,ff)" as well.. Also when I use update 2 I run into memory issues there... – research111 Mar 08 '16 at 13:15
  • I have a feeling that even final merged file will be too big to handle :/ Something you can try foe now is dividing 'files' into smaller chunks. So not all 600 files in one go, but maybe 6 runs of 100 files? – Jav Mar 08 '16 at 15:34