Why is pileup() function in R over-counting reads?

Question

Pileup() function is over-counting the number of reads at variant location. (I inherited this code from someone.)

I've looked through the pileup parameters but none seem to cause over-counting. I visualized variants in IGV and saw the number of reads don't match with what pileup is reporting. Pileup is accurate when it reports read counts for variants with low number of reads (ie. 3, or 8). Pileup is inaccurate when it reports read counts for variants with high number of reads (ie. 20, 60).

res <- pileup( bf, snp, 
        scanBamParam=ScanBamParam( flag = scanBamFlag( hasUnmappedMate=F, isProperPair=T, isDuplicate=F ), which = snp ), 
        pileupParam=PileupParam( distinguish_strands=F, min_base_quality=10, max_depth=1e4 ) )

where bf is bam file, snp is grange object listing all snps found.

> bf
class: BamFile 
path: Sample_4.split.bam
index: Sample_4.split.bai
isOpen: FALSE 
yieldSize: NA 
obeyQname: FALSE 
asMates: FALSE 
qnamePrefixEnd: NA 
qnameSuffixStart: NA 

> head(snp)
GRanges object with 6 ranges and 5 metadata columns:
              seqnames    ranges strand | paramRangeID            REF
                 <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
   rs13475703     chr1   5124338      + |         <NA>              A
    rs6408157     chr1   9547068      + |         <NA>              A

                             ALT      QUAL      FILTER
              <DNAStringSetList> <numeric> <character>
   rs13475703                  G    233.77        PASS
    rs6408157                  G     52.77        PASS

  -------
  seqinfo: 66 sequences from mm10 genome

> res[66,]
   seqnames      pos nucleotide count            which_label
66     chr1 87942764          G    61 chr1:87942764-87942764

For the example above, for the G variant at chr1:87942764, pileup reports 61 reads. When I open the bam file in IGV, I only count 41 reads supporting the G variant.

There are no error messages. Just that the counts reported do not match with what I see in IGV.

We have no hope of helping you if we don't have the code for `pileup`. — Hugh, Aug 22 '19 at 15:15
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Be sure to clearly list all the R packages you are using. `pipeup` isn't a base R function. — MrFlick, Aug 22 '19 at 15:32
Hugh: pileup() is from Rsamtools package https://www.rdocumentation.org/packages/Rsamtools/versions/1.24.0/topics/pileup MrFlick: I would need to create a bam file and somehow attach it to this post. I can try doing something like that. — user11885521, Aug 22 '19 at 20:05
Do you have IGV configured to show all reads (**View > Preferences... > Alignment Track Options > Downsample Reads**)? The default setting for IGV is to downsample reads, so it's possible the visualization is not showing all the actual reads. — merv, Aug 23 '19 at 02:18
I have seen many tools silently dropping low quality reads or poorly mapped. Try to get the pileup from the samtools utility (samtools mpileup -f ref.fasta -r chr1:1000-2000 file.bam), and get a real count. — xbello, Aug 23 '19 at 11:08
You might be interested to know that there is a bioinformatics-dedicated stackexchange: bioinformatics.stackexchange.com. — bli, Aug 25 '19 at 14:07
merv: Yes your comment was the solution. I have to uncheck "filter vendor failed reads". Thank you. I'm sorry I don't know how to mark your comment as the answer. I'm still new to stackexchange. bli: Thank you for letting me know. I will join and post there instead of here in the future. — user11885521, Aug 26 '19 at 16:02
If user @merv doesn't answer, you should copy-paste his comment and make an answer to your own question. Write something like "As user merv commented...", and accept your own answer. — xbello, Aug 27 '19 at 09:56

score 0 · Accepted Answer · answered Aug 27 '19 at 17:29

As user merv commented: Do you have IGV configured to show all reads (View > Preferences... > Alignment Track Options > Downsample Reads)? The default setting for IGV is to downsample reads, so it's possible the visualization is not showing all the actual reads.

When I uncheck "filter vendor failed reads", the read counts match what is reported by pileup. Thank you merv.

Why is pileup() function in R over-counting reads?

1 Answers1