1

I would like to know if there is a way to have as output a single file where each column has some computation numbers taken from multiple files. My input is:

@SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
AGTAAAGGGACTCGGTCTCCTTCCATTGGAGGTTGTTTTCTAGGCTCAACAC
+SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52 
?;=ADDDDF@C3ACE:E?FED+CF>AABGFFB:?10?:BDDFB?@3BFFEEF 
@SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
TTGATAGGGGAGATGCTAGCAAAAAGGTGTACTTCTCAGCGGAGCAGAAAGA
+SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
CCCFFFFFHHHHHIHIGHIIIGGIHII?DGHIIIIIIEHCHIIIIIIHIHHI 
@SRR1544694.3 Run0199_AC237YACXX_L2_T1101_C54 length=52
TTTTTGGGGGGGAATTCTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCA

The aim is to count the percentage of G and C elements in the lines in the ATGC lines (second row and every 4 rows). The real files will have millions of lines. The expected output should be:

File1   File2
48.0769 48.0769   
46.1538 46.1538 
42.3077 42.3077 
32.6923 32.6923  
51.9231 51.9231  
42.3077 42.3077    

I have tried the code below. It outputs the calculations done in specific lines, to a single file matching each original file. If the output is not defined, it will print a single column.

awk '
FNR==1{   # first record of an input file?
  if(o)close(o);   # was previous output file? close it
  o=FILENAME;sub(/\.fastq/,"_sorted.txt",o)   # new output file name
}
{
  if(NR%4==2){n=length($1); gc=gsub("[gcGC]", "", $1); print gc/n*100 >o}
}
' *.fastq 

I would like to know if there is a way, using awk (especially to learn the tool) to have all the calculations in a single file, column separated.

Pestana
  • 23
  • 7
  • Welcome to SO, could you please post sample Input and expected output in your post and let us know then. – RavinderSingh13 Oct 22 '19 at 14:31
  • Yes there is a way. Please [edit] your question to provide concise, testable sample input and expected output if you need any more information than that. – Ed Morton Oct 22 '19 at 15:05

1 Answers1

0

With the following I recreated the input files:

cat >1.fastq <<EOF
@SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
AGTAAAGGGACTCGGTCTCCTTCCATTGGAGGTTGTTTTCTAGGCTCAACAC
+SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52 
?;=ADDDDF@C3ACE:E?FED+CF>AABGFFB:?10?:BDDFB?@3BFFEEF 
@SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
TTGATAGGGGAGATGCTAGCAAAAAGGTGTACTTCTCAGCGGAGCAGAAAGA
+SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
CCCFFFFFHHHHHIHIGHIIIGGIHII?DGHIIIIIIEHCHIIIIIIHIHHI 
@SRR1544694.3 Run0199_AC237YACXX_L2_T1101_C54 length=52
TTTTTGGGGGGGAATTCTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCA
EOF

cat >2.fastq <<EOF
@SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
AGTAAAGGGACTCGGTCTCCTTCCATTGGAGGTTGTTTTCTAGGCTCAACAC
+SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
?;=ADDDDF@C3ACE:E?FED+CF>AABGFFB:?10?:BDDFB?@3BFFEEF
@SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
TTGATAGGGGAGATGCTAGCAAAAAGGTGTACTTCTCAGCGGAGCAGAAAGA
+SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
CCCFFFFFHHHHHIHIGHIIIGGIHII?DGHIIIIIIEHCHIIIIIIHIHHI
@SRR1544694.3 Run0199_AC237YACXX_L2_T1101_C54 length=52
TTTTTGGGGGGGAATTCTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCA
EOF

The following script with comments:

# print the headers
printf "%s\n" *.fastq | paste -sd' '
# merge all files
paste *.fastq |
# for each line for each the field print the counts
awk 'NR % 4 == 2{
        for (i = 1; i <= NF; ++i) {
                n = length($i); gc = gsub("[gcGC]", "", $i); res = gc/n*100
                printf res (i == NF ? "\n" : "\t")
        }
}' "$i"

will generate the following output (both files are the same, it's just to illustrate it works):

1.fastq 2.fastq
48.0769 48.0769
46.1538 46.1538
42.3077 42.3077

I didn't modify the logic behind calculating the percentages, only modified how the results are merged.

I leave my other tries to do it below:

Alternatively with the following script, that folds the result using paste not inside awk:

output=""
for i in *.fastq; do
        output=$(
                printf "%s\n" "$output" | 
                # merge line-wise output with another results
                paste - <(
                        echo "$i"
                        awk '{ if(NR%4==2){n=length($1); gc=gsub("[gcGC]", "", $1); print gc/n*100} }' "$i"
                )
        )
done
# in output theere will be leading tabs
# remove them
output=$(printf "%s" "$output" | cut -f2-)

echo "$output"

If there are really million of lines and files, using temporary file for paste output instead of storing it in a variable will save ton of memory.

Alternatively it to generate all results line-wise and then for example transpose the output using awk script from stackoverflow:

for i in *.fastq; do
        printf "%s" "$i"
        awk '{ if(NR%4==2){n=length($1); gc=gsub("[gcGC]", "", $1); printf " " gc/n*100} }' "$i"
        echo
done |
awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}'
KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • Hi Kamil! So far I have been testing your second solution, placing awk inside paste command. It works really well!! I will try the last one as well and will update asap. Shouldn't storing in a variable work faster than with a temp file? I do not have any memory or cpu/disk constraint. – Pestana Oct 28 '19 at 08:09
  • `storing in a variable work faster than with a temp file` - faster, yes. More memory consuming - also yes. The file can be big enough, to drain you memory. So for very big files, you can use a temp file. – KamilCuk Oct 28 '19 at 09:20