I have quality controlled paired end environmental transcriptomic data that I want to map against a reference database of 8 cyanobacterial genomes. I made this reference by joining together the fasta files of each genome.
I performed this mapping with BWA v0.7.3a but noticed that when I tried to count the reads mapped to each gene in the reference with htseq-count, the SAM file output from BWA was corrupted.
Here is my BWA command:
bwa mem -k 10 -aM -t 128 Merged_reference Input_file1 Input_file2 > Out.sam
Here is my HTSeq command:
htseq-count -f sam -r pos -a 0 -m intersection-strict -s no Out.sam Merged_ref2.gtf2 > HTSeq_out
This is the error I see when running HTSeq:
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1_sam] Parse error at line 8839134
Error occured when processing input (record #8809785 in file Out.sam):
truncated file
[Exception type: OSError, raised in libcalignmentfile.pyx:1876]
Here is what the SAM file looks like at lines 8839134 and 8839135:
A00521:380:HFKMLDSX5:4:1163:32081:25332 65 lcl|NC_013771.1_cds_WP_012953431.1_1 0 0 268435455S89S lcl|NC_013771.1_cds_WP_012954293.1_845 87 0 Sequence Quality NM:i:8388607 AS:i:73 XS:i:72
A00521:380:HFKMLDSX5:4:1163:32081:25332 129 lcl|NC_013771.1_cds_WP_012954293.1_845 87 0 2S15M72S lcl|NC_013771.1_cds_WP_012953431.1_1 0 0 Sequence Quality NM:i:0 AS:i:15 XS:i:15
I know that if I change these lines to the following and run htseq on a subset of my data up to the changed lines the error disappears:
A00521:380:HFKMLDSX5:4:1163:32081:25332 65 * 0 0 * * 0 0 Sequence Quality NM:i:8388607 AS:i:73 XS:i:72
A00521:380:HFKMLDSX5:4:1163:32081:25332 129 * 0 0 * * 0 0 Sequence Quality NM:i:0 AS:i:15 XS:i:15
While this change makes sense to me because the weird sequence causing this should be treated as an unmapped read, unfortunately I have noticed that this error occurs approximately every 4 million lines in my SAM file. Since I have ~115 million lines in the SAM file and 60+ SAM files I would like to get to the bottom of this to efficiently solve this problem.
Has anyone else seen this issue? I can't figure out why BWA would be making this mistake!
A few other notes, every time there is this error it is triggered by the lcl|NC_013771.1_cds_WP_012954293.1_845 sequence. This sequence is the first one in my reference file.