My aim is to efficiently convert bam files containing bowtie2 mappings of short DNA sequencing reads to a simple table containing a the start of the mapping and the percentage identity. I am close to accomplishing this, however my solution is very slow and cannot handle an important exception. I will explain the situation step by step:
We start with a string like this
FCC5G2YACXX:5:1101:1224:2059#NNNNNNNN 97 genome 96003934 24 118M4D11M = 96004135 0 GCA....ACG P\..GW^EO AS:i:-28 XN:i:0 XM:i:2 XO:i:1 XG:i:4 NM:i:6 MD:Z:54G53T9^TACA11 YT:Z:UP
We take only the fourth column and the MD:Z:54G53T9^TACA11
part, the latter represents matches (numbers) and mismatches (letters). Except if letters are preceded by an '^' then letters are deletions in the reference sequence.
Then I compute this into a string like this 96003934 98.00
Where the first column is identical to the fourth column of the original input, and the second column is the matches, divided by the sum matches and mismatches.
Since I prefer bash/zsh scripts I did the following
if_sam2tab() {
tab=$(echo $1 | rev | cut -d '.' -f 2- | rev)
if [ ! -e ./bowtie_results/$tab.tab ]
then echo -e "rstart\tmatch\tmismatch" > ./bowtie_results/$tab.tab
while read -r l
do rstart=$(echo $l | cut -f 4 -d " " )
t=$(echo $l | grep -o 'MD:Z:[0-9A-Z]*' )
match=$(echo ${t: 5} | tr '[a-zA-Z]' '+' | bc )
mismatch=$(echo ${t: 5} | tr -d '[0-9]\n' | wc -c )
sum=$(echo "$match + $mismatch" | bc )
id=$(echo "scale=2; $match / $sum *100" | bc )
echo -e "$rstart\t$id" >> ./bowtie_results/$tab.tab
done < <(grep 'MD:Z' ./bowtie_results/$1 )
fi
}
This solution however is faulty since it regards deletions such as ^TCTAAG
as mismatches.
Secondly it seems terribly slow to me, even when I run them in parallel five of these functions on a six cpu's.
Recapitulating, I aim to efficiently convert a string with mapping information, to identity percentages.
Thanks for your consideration