0

I have written a code and it works fine. However, for practical reasons (and I also want to learn more) it is ideal if I have a shorter way of doing things. Here is an example of a text file I am reading:

Analysis Date: Tue Oct 16 09:39:06 EDT 2018
Input file(s): 012-915-8-rep1.fastq
Output file(s): 012-915-8-rep1.vdjca
Version: 2.1.12; built=Wed Aug 22 08:47:36 EDT 2018; rev=99f9cc0; 
lib=repseqio.v1.5
Command line arguments: align -c IGH -r report012-915-8-rep1.txt 012-915-8- 
rep1.fastq 012-915-8-rep1.vdjca
Analysis time: 45.45s
Total sequencing reads: 198274
Successfully aligned reads: 167824 (84.64%)
Alignment failed, no hits (not TCR/IG?): 12122 (6.11%)
Alignment failed because of absence of J hits: 18235 (9.2%)
Alignment failed because of low total score: 93 (0.05%)
Overlapped: 0 (0%)
Overlapped and aligned: 0 (0%)
Alignment-aided overlaps: 0 (?%)
Overlapped and not aligned: 0 (0%)
IGH chains: 167824 (100%)
======================================
Analysis Date: Tue Oct 16 09:39:52 EDT 2018
Input file(s): 012-915-8-rep1.vdjca
Output file(s): 012-915-8-rep1.clns
Version: 2.1.12; built=Wed Aug 22 08:47:36 EDT 2018; rev=99f9cc0; lib=repseqio.v1.5
Command line arguments: assemble -OaddReadsCountOnClustering=true -r 
report012-915-8-rep1.txt 012-915-8-rep1.vdjca 012-915-8-rep1.clns
Analysis time: 7.50s
Final clonotype count: 1227
Average number of reads per clonotype: 124.77
Reads used in clonotypes, percent of total: 153096 (77.21%)
Reads used in clonotypes before clustering, percent of total: 153096 (77.21%)
Number of reads used as a core, percent of used: 113699 (74.27%)
Mapped low quality reads, percent of used: 39397 (25.73%)
Reads clustered in PCR error correction, percent of used: 14522 (9.49%)
Reads pre-clustered due to the similar VJC-lists, percent of used: 0 (0%)
Reads dropped due to the lack of a clone sequence: 8958 (4.52%)
Reads dropped due to low quality: 0 (0%)
Reads dropped due to failed mapping: 5770 (2.91%)
Reads dropped with low quality clones: 0 (0%)
Clonotypes eliminated by PCR error correction: 5550
Clonotypes dropped as low quality: 0
Clonotypes pre-clustered due to the similar VJC-lists: 0
======================================

I basically want just line 7,8 and 26, which are: "Total Sequencing Reads", "Successfully Aligned Reads", and "Reads used in clonotypes, percent of total". Everything else can be eliminated. My code to do this, for several text files is as such:

> # Put in your actual path where the text files are saved 
> mypath = "C:/Users/ME/Desktop/REPORTS/text files/" 
> setwd(mypath)
> #############################################################
> #Functional Code
> #Establish the dataframe 
> data <- data.frame("Total seq Reads"=integer(), "Successful Reads"=integer(), "Clonotypes"=integer())
> 
> #this should be a loop, I think, same action repeats, I just dont know how to format
> 
> wow <- readLines("C:/Users/ME/Desktop/REPORTS/text files/report012-915-8-rep1.txt") 
> woah <- wow[-c(1:6,9:25,27:39)] 
> blah <- as.numeric(gsub("\\D", "", gsub("\\(.*\\)", "", woah)))
> data[nrow(data)+1,] <- blah
> 
> wow <- readLines("C:/Users/ME/Desktop/REPORTS/text files/report012-915-8-rep2.txt") 
> woah <- wow[-c(1:6,9:25,27:39)] 
> blah <- as.numeric(gsub("\\D", "", gsub("\\(.*\\)", "", woah)))
> data[nrow(data)+1,] <- blah
>
> row.names(data) <- c("012-915-8-rep1","012-915-8-rep2")
>
># Write CSV in R
> write.csv(data, file = "Report_Summary.csv")

Is there a more efficient way of doing this? I only placed 2 files as examples here, but in reality I am utilizing around 20-80 files, which means that this process I would have to do manually. Any help would be appreciated! Thank you!

Lasarus9
  • 83
  • 9

1 Answers1

0

You can make it a function and loop it on your files. One thing you should be aware of is growing vectors/data.frames, like with this data[nrow(data)+1,] <- blah. It's generally inefficient, so either start with a vector, etc. of the desired size and write output to it, or bind/reshape. For a small number of rows you may not notice it, but you will the more rows you have. If interested, read up on vectorization.

textfunction <- function(x) {
  wow <- readLines(x)
  woah <- wow[c(9:10,29)] # I think these are the lines you are referencing
  blah <- as.numeric(gsub("\\D", "", gsub("\\(.*\\)", "", woah)))
}

Then get your directory, get the filenames, apply your function, and transpose/rename.

library(data.table)
dir = "C:/Users/ME/Documents/"
filenames <- list.files(path = dir, pattern = "*.txt", full.names = FALSE)
textreads <- lapply(filenames, function(x) textfunction(x))
data <- as.data.frame(data.table::transpose(textreads), col.names = c("Total seq Reads", "Successful Reads", "Clonotypes"), row.names = filenames)

data
          Total.seq.Reads Successful.Reads Clonotypes
text1.txt          198274           167824     153096
text2.txt          198274           167824     153096
Anonymous coward
  • 2,061
  • 1
  • 16
  • 29
  • Thank you so much! I will try this out with the some more files! Thank you! – Lasarus9 Nov 05 '18 at 21:43
  • You're welcome. Don't forget on StackOverflow to [upvote any answers that are useful and accept answers that provide the best solution to your problem](https://stackoverflow.com/help/someone-answers). And in future questions [reproducible examples](https://stackoverflow.com/a/5963610/2359523) are helpful to others. – Anonymous coward Nov 09 '18 at 16:32