I have written a code and it works fine. However, for practical reasons (and I also want to learn more) it is ideal if I have a shorter way of doing things. Here is an example of a text file I am reading:
Analysis Date: Tue Oct 16 09:39:06 EDT 2018
Input file(s): 012-915-8-rep1.fastq
Output file(s): 012-915-8-rep1.vdjca
Version: 2.1.12; built=Wed Aug 22 08:47:36 EDT 2018; rev=99f9cc0;
lib=repseqio.v1.5
Command line arguments: align -c IGH -r report012-915-8-rep1.txt 012-915-8-
rep1.fastq 012-915-8-rep1.vdjca
Analysis time: 45.45s
Total sequencing reads: 198274
Successfully aligned reads: 167824 (84.64%)
Alignment failed, no hits (not TCR/IG?): 12122 (6.11%)
Alignment failed because of absence of J hits: 18235 (9.2%)
Alignment failed because of low total score: 93 (0.05%)
Overlapped: 0 (0%)
Overlapped and aligned: 0 (0%)
Alignment-aided overlaps: 0 (?%)
Overlapped and not aligned: 0 (0%)
IGH chains: 167824 (100%)
======================================
Analysis Date: Tue Oct 16 09:39:52 EDT 2018
Input file(s): 012-915-8-rep1.vdjca
Output file(s): 012-915-8-rep1.clns
Version: 2.1.12; built=Wed Aug 22 08:47:36 EDT 2018; rev=99f9cc0; lib=repseqio.v1.5
Command line arguments: assemble -OaddReadsCountOnClustering=true -r
report012-915-8-rep1.txt 012-915-8-rep1.vdjca 012-915-8-rep1.clns
Analysis time: 7.50s
Final clonotype count: 1227
Average number of reads per clonotype: 124.77
Reads used in clonotypes, percent of total: 153096 (77.21%)
Reads used in clonotypes before clustering, percent of total: 153096 (77.21%)
Number of reads used as a core, percent of used: 113699 (74.27%)
Mapped low quality reads, percent of used: 39397 (25.73%)
Reads clustered in PCR error correction, percent of used: 14522 (9.49%)
Reads pre-clustered due to the similar VJC-lists, percent of used: 0 (0%)
Reads dropped due to the lack of a clone sequence: 8958 (4.52%)
Reads dropped due to low quality: 0 (0%)
Reads dropped due to failed mapping: 5770 (2.91%)
Reads dropped with low quality clones: 0 (0%)
Clonotypes eliminated by PCR error correction: 5550
Clonotypes dropped as low quality: 0
Clonotypes pre-clustered due to the similar VJC-lists: 0
======================================
I basically want just line 7,8 and 26, which are: "Total Sequencing Reads", "Successfully Aligned Reads", and "Reads used in clonotypes, percent of total". Everything else can be eliminated. My code to do this, for several text files is as such:
> # Put in your actual path where the text files are saved
> mypath = "C:/Users/ME/Desktop/REPORTS/text files/"
> setwd(mypath)
> #############################################################
> #Functional Code
> #Establish the dataframe
> data <- data.frame("Total seq Reads"=integer(), "Successful Reads"=integer(), "Clonotypes"=integer())
>
> #this should be a loop, I think, same action repeats, I just dont know how to format
>
> wow <- readLines("C:/Users/ME/Desktop/REPORTS/text files/report012-915-8-rep1.txt")
> woah <- wow[-c(1:6,9:25,27:39)]
> blah <- as.numeric(gsub("\\D", "", gsub("\\(.*\\)", "", woah)))
> data[nrow(data)+1,] <- blah
>
> wow <- readLines("C:/Users/ME/Desktop/REPORTS/text files/report012-915-8-rep2.txt")
> woah <- wow[-c(1:6,9:25,27:39)]
> blah <- as.numeric(gsub("\\D", "", gsub("\\(.*\\)", "", woah)))
> data[nrow(data)+1,] <- blah
>
> row.names(data) <- c("012-915-8-rep1","012-915-8-rep2")
>
># Write CSV in R
> write.csv(data, file = "Report_Summary.csv")
Is there a more efficient way of doing this? I only placed 2 files as examples here, but in reality I am utilizing around 20-80 files, which means that this process I would have to do manually. Any help would be appreciated! Thank you!