Efficient way to TRANSLATE every Nth string in bash or R

Question

Thank you for taking the time to look at this.

I have a fastq file and I want to translate it to the complementary, but not the reverse complementary, something like this:

@Some header example:1:
ACTGAGACTCGATCA
+
S0m3_Qu4l1t13s&

Translated to

@Some header example:1:
TGACTCTGAGCTAGT
+
S0m3_Qu4l1t13s&

And the code I used is:

awk '{
  if(NR==100000){break} 
  else if((NR+2) % 4 ==0 ){ system("echo " $0 "| tr ATGC TACG") }
  else print $0}' MyFastqFyle.fastq > MyDesiredFile.fastq

And it works! but this approach is slooooooooow, even with small files (250M). I wonder which other way will get this done faster, doesn't matter if this is in R or bash or similar.

(I looked at BioStrings But I only found the reverse complimentary function, and there are some issues with the "@" in the header instead of the ">")

`chartr("TAGC", "ATCG", "ACTGAGACTCGATCA")` in plain R code – Rich Scriven Apr 08 '15 at 21:34 — Rich Scriven, Apr 08 '15 at 21:34
how to apply that to the whole file (exclusive for 4th row) – Edahi Apr 08 '15 at 22:05 — Edahi, Apr 08 '15 at 22:05

score 3 · Accepted Answer · answered Apr 08 '15 at 21:28

3

This is slow because you spawn a shell and a process in it for every changed line. Just do it with sed:

sed '2~4 y/ATGC/TACG/' MyFastqFyle.fastq > MyDesiredFile.fastq

This requires GNU sed, so I hope you're not on Mac OS X. If you are,

sed 'n; y/ATGC/TACG/; n; n' MyFastqFyle.fastq > MyDesiredFile.fastq

should work.

answered Apr 08 '15 at 21:28

Wintermute

42,983
5
77
80

Thanks! That's it. I wanted to accept this answer but I need to wait 6 more minutes, ha – Edahi Apr 08 '15 at 21:31

score 1 · Answer 2 · answered Apr 08 '15 at 22:53

1

Here is the solution using Biostrings (and ShortRead):

library(ShortRead)
raw <- sread(readFastq("MyFastqFyle.fastq"))
complemented <- complement(raw)

answered Apr 08 '15 at 22:53

Michael Lawrence

1,031
5
6

Efficient way to TRANSLATE every Nth string in bash or R

2 Answers2