1

I truly know that 'large matrix issue' is a recurrent topic here, but I would like to explain minutely my specific problem regarding large matrices.

Strictly speaking, I want to cbind several large matrices with a specific name pattern in R. The below code shows my best try until this point.

First lets produce files to mimetize my real matrices:

# The df1
df1 <- '######## infx infx infx
######## infx infx infx
probeset_id sample1 sample2 sample3
PR01           1       2       0
PR02           -1      2       0
PR03            2      1       1
PR04           1       2       1
PR05           2       0       1'
df1 <- read.table(text=df1, header=T, skip=2)
write.table(df1, "df1.txt", col.names=T, row.names=F, quote=F, sep="\t")

# The df2 
df2 <- '######## infx infx infx
######## infx infx infx
probeset_id sample4 sample5 sample6
PR01           2       2       1
PR02           2      -1       0
PR03            2      1       1
PR04           1       2       1
PR05           0       0       1'
df2 <- read.table(text=df2, header=T, skip=2)
write.table(df2, "df2.txt", col.names=T, row.names=F, quote=F, sep="\t")

# The dfn 
dfn <- '######## infx infx infx
######## infx infx infx
probeset_id samplen1 samplen2 samplen3
PR01           2       -1       1
PR02           1      -1       0
PR03            2      1       1
PR04           1       2       -1
PR05           0       2       1'
dfn <- read.table(text=dfn, header=T, skip=2)
write.table(dfn, "dfn.txt", col.names=T, row.names=F, quote=F, sep="\t")

Then import it to R and write as my expected output file:

### Importing and excluding duplicated 'probeset_id' column
calls = list.files(pattern="*.txt")
library(data.table)
calls = lapply(calls, fread, header=T)
mycalls <- as.data.frame(calls)
probenc <- as.data.frame(mycalls[,1])
mycalls <- mycalls[, -grep("probe", colnames(mycalls))]
output <- cbind(probenc, mycalls)
names(output)[1] <- "probeset_id"
write.table(output, "output.txt", col.names=T, row.names=F, quote=F, sep="\t")

How the output looks like:

> head(output)
  probeset_id sample1 sample2 sample3 sample4 sample5 sample6 samplen1 samplen2 samplen3
1        PR01       1       2       0       2       2       1        2       -1        1
2        PR02      -1       2       0       2      -1       0        1       -1        0
3        PR03       2       1       1       2       1       1        2        1        1
4        PR04       1       2       1       1       2       1        1        2       -1
5        PR05       2       0       1       0       0       1        0        2        1

This code works perfectly for what I want to do, however, I face the known R memory limitation using my real data (more than 30 "df" objects with ~1.3GB or/and 600k rows by 100 columns each).

I read about a very interesting SQL approach (R: how to rbind two huge data-frames without running out of memory) but I am inexperienced in SQL and did not found a way to adaptate it to my case.

Cheers,

Community
  • 1
  • 1
user2120870
  • 869
  • 4
  • 16

1 Answers1

1

I had misunderstood the question previously; now the comment made it clear. What you need then is to work with a package like ff. This lets you work with files from your hard disk as opposed to loading them in RAM. This looks like a solution to your problem as you mention that RAM is not enough to load all the files in your system.

First of all load the files with read.table.ffdf and then use the following to cbind them together:

#load files in R
library(ff)

df1 <- read.table.ffdf('df1.txt', header=T, skip=2)
df2 <- read.table.ffdf('df2.txt', header=T, skip=2)
dfn <- read.table.ffdf('dfn.txt', header=T, skip=2)

And then merge like this:

mergedf <- do.call('ffdf', c(physical(df1), physical(df2), physical(dfn)))

Unfortunately, I cannot use your example as read.table.ffdf does not support the text argument, but the above should work. The ff package has its own (not very complex) syntax that you might need to familiarize yourself with as it works with files off of your hard disk. For example apply functions are done using the ffapply function in pretty much the same way as apply.

Have a look here, here and here for some basic tutorials on package ff.

You can also see the functions inside the package and use the in-built help to help yourself with ls(package:ff).

Community
  • 1
  • 1
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • The main problem is that my memory crashs before reading all my big matrices into R environment. That´s why I did the code to produce those 3 "df" text files. My real question is how to create a concetenated file (as my output.txt) without running out memory. – user2120870 Sep 22 '15 at 11:45
  • I have updated answer with a recommended solution now that I have understood the problem. – LyzandeR Sep 22 '15 at 13:55