2

I am using R to cbind about ~11000 files using:

dat <- do.call('bind_cols',lapply(lfiles,read.delim))

which is unbelievably slow. I am using R because my downstream processing like creating plots etc is in R. What are some fast alternatives to concatenating thousands of files by columns?

I have three types of files for which I want this done. They look like this:

[centos@ip data]$ head C021_0011_001786_tumor_RNASeq.abundance.tsv
target_id   length  eff_length  est_counts  tpm
ENST00000619216.1   68  26.6432 10.9074 5.69241
ENST00000473358.1   712 525.473 0   0
ENST00000469289.1   535 348.721 0   0
ENST00000607096.1   138 15.8599 0   0
ENST00000417324.1   1187    1000.44 0.0673096   0.000935515
ENST00000461467.1   590 403.565 3.22654 0.11117
ENST00000335137.3   918 731.448 0   0
ENST00000466430.5   2748    2561.44 162.535 0.882322
ENST00000495576.1   1319    1132.44 0   0

[centos@ip data]$ head C021_0011_001786_tumor_RNASeq.rsem.genes.norm_counts.hugo.tab
gene_id C021_0011_001786_tumor_RNASeq
TSPAN6  1979.7185
TNMD    1.321
DPM1    1878.8831
SCYL3   452.0372
C1orf112    203.6125
FGR 494.049
CFH 509.8964
FUCA2   1821.6096
GCLC    1557.4431

[centos@ip data]$ head CPBT_0009_1_tumor_RNASeq.rsem.genes.norm_counts.tab
gene_id CPBT_0009_1_tumor_RNASeq
ENSG00000000003.14  2005.0934
ENSG00000000005.5   5.0934
ENSG00000000419.12  1100.1698
ENSG00000000457.13  2376.9100
ENSG00000000460.16  1536.5025
ENSG00000000938.12  443.1239
ENSG00000000971.15  1186.5365
ENSG00000001036.13  1091.6808
ENSG00000001084.10  1602.7165

Thanks!

Sunny Patel
  • 7,830
  • 2
  • 31
  • 46
Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • 2
    Try profiling you code. Mt bet is that `read.delim` is your bottle neck. Try `data.table::fread` instead. Also Google regarding fast reading data sets into R – David Arenburg Aug 08 '16 at 18:17
  • http://stackoverflow.com/questions/19697700/how-to-speed-up-rbind – abhiieor Aug 08 '16 at 18:17
  • 1
    Try reading with `fread` from `data.table` i.e. `lapply(lfiles, fread)` and instead of having many columns, it may be better to `rbind` with a grouping variable, i.e. `rbindlist(lapply(lfiles, fread), idcol=TRUE)` (though the post contains very little info to give any kind of solution) – akrun Aug 08 '16 at 18:18
  • are you looking to `cbind` or `rbind`? `rbindlist` from `data.table` is very fast – Chris Aug 08 '16 at 18:19
  • @Chris am trying column binding - `cbind`. – Komal Rathi Aug 08 '16 at 18:22
  • @akrun there are different types of files that I am trying to concatenate, I was just trying to figure out the best way to concatenate large number of files by columns. – Komal Rathi Aug 08 '16 at 18:32
  • @akrun I would like to accept your comment as answer - can you move it to an answer? – Komal Rathi Aug 09 '16 at 13:44
  • @KomalRathi Thanks, posted that as an answer. – akrun Aug 09 '16 at 17:48
  • I wonder whether it's worth looking in to the Unix [paste function](http://stackoverflow.com/questions/16910057/how-to-paste-columns-from-separate-files-using-bash) ...? – Ben Bolker Aug 11 '16 at 14:54

2 Answers2

3

For fast reading of files, we can use fread from data.table and then rbind the list of data.table using rbindlist specifying the idcol=TRUE to provide a grouping variable to identify each of the datasets

library(data.table)
DT <- rbindlist(lapply(lfiles, fread), idcol=TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662
2

If you have all numerical data, you can convert to matrix first, which can be quite a bit faster than data frames:

> microbenchmark(
do.call(cbind, rep(list(sleep), 1000)),
do.call(cbind, rep(list(as.matrix(sleep)), 1000))
)
Unit: microseconds
                                              expr      min       lq       mean
            do.call(cbind, rep(list(sleep), 1000)) 6978.635 7496.690 8038.21531
 do.call(cbind, rep(list(as.matrix(sleep)), 1000))  636.282  722.814  862.01125
   median        uq       max neval
 7864.180 8397.8595 12213.473   100
  744.647  793.0695  7416.430   100

Alternatively, if you want a data frame, you can cheat by using unlist and then setting the class manually:

df <- unlist(rep(list(sleep), 1000), recursive=FALSE)
class(df) <- 'data.frame'
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60