I would like to know if there is a way to have as output a single file where each column has some computation numbers taken from multiple files. My input is:
@SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
AGTAAAGGGACTCGGTCTCCTTCCATTGGAGGTTGTTTTCTAGGCTCAACAC
+SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
?;=ADDDDF@C3ACE:E?FED+CF>AABGFFB:?10?:BDDFB?@3BFFEEF
@SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
TTGATAGGGGAGATGCTAGCAAAAAGGTGTACTTCTCAGCGGAGCAGAAAGA
+SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
CCCFFFFFHHHHHIHIGHIIIGGIHII?DGHIIIIIIEHCHIIIIIIHIHHI
@SRR1544694.3 Run0199_AC237YACXX_L2_T1101_C54 length=52
TTTTTGGGGGGGAATTCTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCA
The aim is to count the percentage of G and C elements in the lines in the ATGC lines (second row and every 4 rows). The real files will have millions of lines. The expected output should be:
File1 File2
48.0769 48.0769
46.1538 46.1538
42.3077 42.3077
32.6923 32.6923
51.9231 51.9231
42.3077 42.3077
I have tried the code below. It outputs the calculations done in specific lines, to a single file matching each original file. If the output is not defined, it will print a single column.
awk '
FNR==1{ # first record of an input file?
if(o)close(o); # was previous output file? close it
o=FILENAME;sub(/\.fastq/,"_sorted.txt",o) # new output file name
}
{
if(NR%4==2){n=length($1); gc=gsub("[gcGC]", "", $1); print gc/n*100 >o}
}
' *.fastq
I would like to know if there is a way, using awk (especially to learn the tool) to have all the calculations in a single file, column separated.