3

My aim is to efficiently convert bam files containing bowtie2 mappings of short DNA sequencing reads to a simple table containing a the start of the mapping and the percentage identity. I am close to accomplishing this, however my solution is very slow and cannot handle an important exception. I will explain the situation step by step:

We start with a string like this

    FCC5G2YACXX:5:1101:1224:2059#NNNNNNNN   97  genome  96003934    24  118M4D11M   =   96004135    0   GCA....ACG  P\..GW^EO   AS:i:-28    XN:i:0  XM:i:2  XO:i:1  XG:i:4  NM:i:6  MD:Z:54G53T9^TACA11 YT:Z:UP

We take only the fourth column and the MD:Z:54G53T9^TACA11 part, the latter represents matches (numbers) and mismatches (letters). Except if letters are preceded by an '^' then letters are deletions in the reference sequence.

Then I compute this into a string like this 96003934 98.00

Where the first column is identical to the fourth column of the original input, and the second column is the matches, divided by the sum matches and mismatches.

Since I prefer bash/zsh scripts I did the following

    if_sam2tab() {
    tab=$(echo $1 | rev | cut -d '.' -f 2- | rev)
    if      [ ! -e ./bowtie_results/$tab.tab ]
    then    echo -e "rstart\tmatch\tmismatch" > ./bowtie_results/$tab.tab
            while   read -r l
            do      rstart=$(echo $l | cut -f 4 -d " "      )
                    t=$(echo $l | grep -o 'MD:Z:[0-9A-Z]*' )
                    match=$(echo ${t: 5} | tr '[a-zA-Z]' '+' | bc )
                    mismatch=$(echo ${t: 5} | tr -d '[0-9]\n' | wc -c )
                    sum=$(echo "$match + $mismatch" | bc )
                    id=$(echo "scale=2; $match / $sum *100"         | bc  )
                    echo -e "$rstart\t$id" >> ./bowtie_results/$tab.tab
            done < <(grep 'MD:Z' ./bowtie_results/$1 )
    fi
    }

This solution however is faulty since it regards deletions such as ^TCTAAG as mismatches.

Secondly it seems terribly slow to me, even when I run them in parallel five of these functions on a six cpu's.

Recapitulating, I aim to efficiently convert a string with mapping information, to identity percentages.

Thanks for your consideration

Laura
  • 105
  • 5
  • The terms `efficiently` and `shell scripting` do not go together. shell scripts are always terribly slow. Usually speed is not a concern when using the shell. If it is, then consider moving to a scripting language. – cel Sep 23 '15 at 15:41
  • bam files? like how sam files .... – Jose Ricardo Bustos M. Sep 23 '15 at 20:13
  • Thanks for having a look! Indeed I am aware that shell scripting is not the way to go if speed is essential. I prefer shell scripting, since it is the only kind I am acquainted with and usually is sufficiently fast to study some interesting biology. I chose not to specify this within the question to keep it concise and relevant. Once this paper is out, I hope to find some time to learn a scripting language. – Laura Sep 25 '15 at 15:40

2 Answers2

2

An addition to the previous answer by Jose: The MD string is not necessarily the 18th column. Thus I added a line in Jose's awk script to find the MD string regardless of location in the line. Undoubtedly, adding a regex step in the process will slow it down as a whole.

awk '{
 match($0, /MD:Z:[0-9A-Z\^]*/,m );
 split(m[0],v,/[\^:]/);
 nmatch = split(v[3],vmatch, /[^0-9]/);
 cmatch=0;
 for(i=1; i<=nmatch; i++) cmatch+=vmatch[i];
 printf("%s"OFS"%.2f\n", $4, cmatch*100/(cmatch+nmatch-1));
}' 

I chose to pipe bowtie output directly in this awk script without saving it on disk. In my limited testing, I have not found disk i/o to be limiting nor did I find that piping bowtie2 output directly into awk substantially reduces overall performance when run simultaniously.

Laura
  • 105
  • 5
1
  • Input and output are much faster using binary data (bam file), but you have a sam file (flat version)

  • As @cel already commented before, always a program made with compiled languages (C, C++, etc) is faster than script (ej, bash, awk, etc.)

I show a script in awk, for you to try

awk '{
    split($18,v,/[\^:]/); 
    nmatch = split(v[3],vmatch, /[^0-9]/); 
    cmatch=0; 
    for(i=1; i<=nmatch; i++) cmatch+=vmatch[i]; 
    printf("%s"OFS"%.2f\n", $4, cmatch*100/(cmatch+nmatch-1));
}' file.sam

you get

96003934 98.31

explanation

Column 18: MD:Z:54G53T9^TACA11

match = 54+53+9 = 116

mismatch = count_letter(54G53T9) = 2

id = 116 / (116+2) = 0.9830508474576272

Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62
  • Your solution seems to work! Thanks for you time. I'll take some time later on to dissect what this code exactly does, at first sight this seems biologically fine. – Laura Sep 25 '15 at 16:09