1

I have a script that calculates the copy number variation and saves the data into an existing file named "resultCNV.txt" based on first column information. Here is my script

setwd("./Data")
library(GenomicRanges)
library(dplyr)
library("scales")
require(tidyverse)
#Create annotation or refrence table
genes <- read.table("./Basefile/genes.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE)
genes$chromosome_name <- gsub('X', '23', genes$chromosome_name)
genes$chromosome_name <- gsub('Y', '24', genes$chromosome_name)
colnames(genes) <- c("GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(genes,keep.extra.columns = TRUE)
#File need to be analyzed (3 step: preprocessing, comparison with reference or annotation and post-porcessing)
for(i in 1:36){
  df<- read.table(paste0("BRCA", i, ".txt"), sep="\t", stringsAsFactors=FALSE, header=TRUE)
  df$Chromosome <- gsub('X', '23', df$Chromosome)
  df$Chromosome <- gsub('Y', '24', df$Chromosome)
  colnames(df) <- c("Barcode", "Chr", "Start", "End", "extra1", "extra2")
  cnv <-  makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
  hits <- findOverlaps(genes_GR, cnv, type="within")
  df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
  df_ann <- unique(df_ann)
  df_ann <- df_ann[ , c("GeneSymbol", "Chr", "extra2")]
  colnames(df_ann) <- c("Ensembl_ID","Chr","Seg_value")
  df_ann$Seg_value2 <- abs(df_ann$Seg_value)
  df_ann$Seg_value2 = 2^df_ann$Seg_value2
  df_ann$Seg_value2 = df_ann[, 4] - 1
  df_ann$Seg_value2 = df_ann[, 4] * 2
  df_ann$Seg_value2 <- with(df_ann, sign(Seg_value) * Seg_value2)
  df_ann <- df_ann[ , c("Ensembl_ID", "Seg_value")]
  df_ann$Seg_value <- rescale(df_ann$Seg_value, to = c(-1, 1))
  df_ann1 <- read.table("/Basefile/genesforcomp.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE)
  df <- rbind.data.frame(df_ann, df_ann1)
  df <- df[!duplicated(df$Ensembl_ID),]
  #saving the results into existing file based on first column values
  df1 <- read.delim("resultCNV.txt", check.names=FALSE, stringsAsFactors=FALSE)
  lst <- list(data.frame(df1), data.frame(df))
  df2 <- reduce(lst, full_join, by = "Ensembl_ID") %>% replace(., is.na(.), 0);
  write.table(df2, file="resultCNV.txt", quote = F, sep = "\t", row.names = F)
}

Here is my data for testing Link. It has two folders: base folder: for once reading and Data: for data.

In the last 4 line, I am using full_join function of tidyverse, to add the analyzed column into the last saved output based on the first column value (Ensembl_ID). I am running ~200 file each time and it takes almost 2 hours, while running 100 files takes just 30 minutes (hyperbolic curve in a time vs no. of loop). With each loop, output file size reduces to the original like 900kb and then increase with each cycle like 5 mb then 11 mb, and so on.

Can it is possible to reduce time i.e. not reading the last saved output and just merging the column based on the first column? Any suggestions or ideas of how to loop the script will be appreciated.

Thanks in advance!

SUMIT
  • 563
  • 4
  • 12
  • 2
    Could you make your example [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – Aurèle Jun 01 '21 at 14:35
  • The loop overwrites the output file in each iteration. – Roland Jun 01 '21 at 14:36
  • 1
    There's lots of little things you can do - switching to `data.table` would speed up all your operations. Using `readr` package for reading and writing instead of `read.table` and `write.table` would speed up those operations. Probably there is a way to keep things in memory and skip the intermediate read/write step as you suggest - but it's very hard to help you make those changes without a reproducible example to test on. – Gregor Thomas Jun 01 '21 at 14:38
  • 2
    For reading andwriting tables, instead of `read.table` and `write.table`, you can switch to `data.table::fread` and `data.table::fwrite` – Jonas Jun 01 '21 at 14:39
  • @Gregor Thomas Actually it is not necessary to overwrite the file each time, I would like to only save the single-column output in the last saved file based on first column IDs. – SUMIT Jun 01 '21 at 14:50
  • @jonas and @ Gregor Thomas I will check if I can do it. Thanks to both of you. Have a good day – SUMIT Jun 01 '21 at 14:52
  • 1
    probable easy gain: `df_ann1 <- read.table("/home/sumit/Academic/DHR/TCGA/Gene List/Final1/genesbase.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE)` is now performed >4000 times.. just execute it once, outside the loop. – Wimpel Jun 01 '21 at 14:55
  • @Gregor Thomas (https://stackoverflow.com/users/903061/gregor-thomas) I updated the question and also added the data file. Please look into it. – SUMIT Jun 01 '21 at 15:27
  • @Aurèle (https://stackoverflow.com/users/6197649/aur%c3%a8le) I updated the question and also added the data file. Please look into it. – SUMIT Jun 01 '21 at 15:27
  • 1
    @Jonas Yes data.table does the job. It took 27 sec vs 145 sec with read.data/write.data in running 36 files. Thank you – SUMIT Jun 01 '21 at 15:50
  • 2
    @GregorThomas data.table::fread and fwrite (27 sec) is better than readr (93 sec) or base read.delim (145 sec), saving at least 80% time. I think it is acceptable. Thanks for your help and insight. – SUMIT Jun 01 '21 at 16:11
  • 1
    @GregorThomas I see. But that is one of the main performance issues. File write/read operations are slow. And it is absolutely not necessary here (maybe except writing to file once in the end). – Roland Jun 02 '21 at 07:28
  • @Roland Actually data.table::fread and fwrite is good as it is not overwriting the file and just merging the data based on the first column. It take the larger file as a base and just pasting the small file with matching rows. Now no more performance issue. Thanks for your help. – SUMIT Jun 02 '21 at 08:32

2 Answers2

1

When I think my loops are too slow I use apply method instead. In your case it would be something like this:

e = function(i){
  df<- read.table(paste0("BRCA", i, ".txt"), sep="\t", stringsAsFactors=FALSE, header=TRUE)
  df$Chromosome <- gsub('X', '23', df$Chromosome)
  df$Chromosome <- gsub('Y', '24', df$Chromosome)
  colnames(df) <- c("Barcode", "Chr", "Start", "End", "extra1", "extra2")
  cnv <-  makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
  hits <- findOverlaps(genes_GR, cnv, type="within")
  df_ann <- cbind(df[subjectHits(hits),],genes[queryHits(hits),])
  df_ann <- unique(df_ann)
  df_ann <- df_ann[ , c("GeneSymbol", "Chr", "extra2")]
  colnames(df_ann) <- c("Ensembl_ID","Chr","Seg_value")
  df_ann$Seg_value2 <- abs(df_ann$Seg_value)
  df_ann$Seg_value2 = 2^df_ann$Seg_value2
  df_ann$Seg_value2 = df_ann[, 4] - 1
  df_ann$Seg_value2 = df_ann[, 4] * 2
  df_ann$Seg_value2 <- with(df_ann, sign(Seg_value) * Seg_value2)
  df_ann <- df_ann[ , c("Ensembl_ID", "Seg_value")]
  df_ann$Seg_value <- rescale(df_ann$Seg_value, to = c(-1, 1))
  df_ann1 <- read.table("/home/sumit/Academic/DHR/TCGA/Gene List/Final1/genesbase.txt", sep="\t", stringsAsFactors=FALSE, header=TRUE)
  df <- rbind.data.frame(df_ann, df_ann1)
  df <- df[!duplicated(df$Ensembl_ID),]
  #saving the results into existing file based on first column values
  df1 <- read.delim("genesforcomp1", check.names=FALSE, stringsAsFactors=FALSE)
  lst <- list(data.frame(df1), data.frame(df))
  df2 <- reduce(lst, full_join, by = "Ensembl_ID") %>% replace(., is.na(.), 0);
  write.table(df2, file="genesforcomp1", quote = F, sep = "\t", row.names = F)
}

lapply(1:4376, e)

In many of my analysis this saved a lot of time for me, I hope it will work as well with yours.

Little bonus, to estimate the time of the lapply thing you can use instead pblapply() from the pbapply package.

I hope this helped you

elielink
  • 1,174
  • 1
  • 10
  • 22
  • I am testing it and report it to back. Thanks – SUMIT Jun 01 '21 at 14:51
  • 1
    `apply` family won't be any faster in this case... sometimes it's faster at creating an output structure, but there is not output structure here. What's slow is the code inside the loop, not whether the looping mechanism is `for` or `lapply`. – Gregor Thomas Jun 01 '21 at 15:02
  • 1
    @GregorThomas Exactly it in fact it increases the time from 47 sec to 53/50 sec when I run 16 files. – SUMIT Jun 01 '21 at 15:28
  • @elielink other some users suggested some approaches, I will check if they can work. Anyway Thanks – SUMIT Jun 01 '21 at 15:30
0

The data.table library fread and fwrite is good in this context. It reduces the time by ~80%. The overall performance of data.table::fread / fwrite (27 sec) is better than readr (93 sec) or read.delim (145 sec). I think it is acceptable.

SUMIT
  • 563
  • 4
  • 12
  • 1
    Your interpretation is incorrect - when you use `read.table` or `read.delim` or `fread`, you read the entire file into R. And when you use `write.table` or `fwrite` you overwrite the entire file. It's just that `fread` and `fwrite` are **much** faster at reading and writing. You could still speed up your code much more by following Roland's advice in the comments to refactor the code to not read/write inside the loop at all. – Gregor Thomas Jun 02 '21 at 13:12
  • @GregorThomas Thanks for correcting me, I am new so have not much idea. I will follow your advice and let you know. I will also modify my answer accordingly. Thanks for giving your precious time. – SUMIT Jun 02 '21 at 16:02