1

I have quality controlled paired end environmental transcriptomic data that I want to map against a reference database of 8 cyanobacterial genomes. I made this reference by joining together the fasta files of each genome.

I performed this mapping with BWA v0.7.3a but noticed that when I tried to count the reads mapped to each gene in the reference with htseq-count, the SAM file output from BWA was corrupted.

Here is my BWA command:

bwa mem -k 10 -aM -t 128 Merged_reference Input_file1 Input_file2 > Out.sam

Here is my HTSeq command:

htseq-count -f sam -r pos -a 0 -m intersection-strict -s no Out.sam Merged_ref2.gtf2 > HTSeq_out

This is the error I see when running HTSeq:

[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1_sam] Parse error at line 8839134
Error occured when processing input (record #8809785 in file Out.sam):
  truncated file
  [Exception type: OSError, raised in libcalignmentfile.pyx:1876]

Here is what the SAM file looks like at lines 8839134 and 8839135:

A00521:380:HFKMLDSX5:4:1163:32081:25332 65      lcl|NC_013771.1_cds_WP_012953431.1_1    0       0       268435455S89S   lcl|NC_013771.1_cds_WP_012954293.1_845  87      0       Sequence   Quality     NM:i:8388607    AS:i:73 XS:i:72

A00521:380:HFKMLDSX5:4:1163:32081:25332 129     lcl|NC_013771.1_cds_WP_012954293.1_845  87      0       2S15M72S        lcl|NC_013771.1_cds_WP_012953431.1_1    0       0      Sequence  Quality  NM:i:0  AS:i:15 XS:i:15

I know that if I change these lines to the following and run htseq on a subset of my data up to the changed lines the error disappears:

A00521:380:HFKMLDSX5:4:1163:32081:25332 65      *    0       0       *   *      0     0 Sequence   Quality     NM:i:8388607    AS:i:73 XS:i:72

A00521:380:HFKMLDSX5:4:1163:32081:25332 129     *    0       0       *   *      0     0      Sequence  Quality  NM:i:0  AS:i:15 XS:i:15

While this change makes sense to me because the weird sequence causing this should be treated as an unmapped read, unfortunately I have noticed that this error occurs approximately every 4 million lines in my SAM file. Since I have ~115 million lines in the SAM file and 60+ SAM files I would like to get to the bottom of this to efficiently solve this problem.

Has anyone else seen this issue? I can't figure out why BWA would be making this mistake!

A few other notes, every time there is this error it is triggered by the lcl|NC_013771.1_cds_WP_012954293.1_845 sequence. This sequence is the first one in my reference file.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
  • There are weird things happening there. Did you tried to extract this read pair from the FastQ and align it alone to reproduce the problem? – Poshi Apr 07 '23 at 21:03
  • 2
    As the complaint says, a mapped read like this one should have a POS>0, and you have a POS==0. Also, the CIGAR does not make sense: 268435455S89S. What the hell that means? Two soft clips one after the other with no matches? And a soft clip of 268435455 bases? I don't believe your read have this length. Check the inputs. – Poshi Apr 07 '23 at 21:06
  • I extracted the pair with seqtk and these are the sequences: R1 @A00521:380:HFKMLDSX5:4:1163:32081:25332 1:N:0:ATTACTCG+ACGTCCTG AGGTGGGCTATCCTGGACTTGAACCAGGGACCTCACCCTTATCAGGGGTGCGCTCTAACCACCTGAGCTAATAGCCCATTCACCGAAC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF,F – Clayton Tracey Apr 10 '23 at 19:02
  • R2 @A00521:380:HFKMLDSX5:4:1163:32081:25332 2:N:0:ATTACTCG+ACGTCCTG TGAGAAGGAAAAACCCTGGTCGAGCAAGGAAAAAGAAGAGAGTGAAGTGAACTTTCAAACTAGAGGTGAGGTTCGGTGAATGGGCTATT + FFFFFFFFF:FF:FFFFFFFFF:FF:FFFFFFFFFF:FFFFFFFFFFFF::FFF:FFF:FFFFFF:FFFFF,F,FFFFFFFFFFFFFFF – Clayton Tracey Apr 10 '23 at 19:03
  • Do you see anything unusual in the sequences? I didn't notice anything . . . Read mapping with just these sequences produced the same result. – Clayton Tracey Apr 10 '23 at 19:08
  • I cannot see anything strange with those sequences. At least, not at first glance. I cannot reproduce the issue because I don't know which reference you are using and because I don't have that BWA version at hand. Why are you not using the latest one? Your version is from March 2013... That could explain that behavior, that you are hitting a fixed bug. – Poshi Apr 10 '23 at 19:48
  • I was using the version already installed on my HPC and didn't realize it was so old. I updated to the newest version and now have no issues. Thank you! – Clayton Tracey Apr 11 '23 at 20:43

0 Answers0