Pileup() function is over-counting the number of reads at variant location. (I inherited this code from someone.)
I've looked through the pileup parameters but none seem to cause over-counting. I visualized variants in IGV and saw the number of reads don't match with what pileup is reporting. Pileup is accurate when it reports read counts for variants with low number of reads (ie. 3, or 8). Pileup is inaccurate when it reports read counts for variants with high number of reads (ie. 20, 60).
res <- pileup( bf, snp,
scanBamParam=ScanBamParam( flag = scanBamFlag( hasUnmappedMate=F, isProperPair=T, isDuplicate=F ), which = snp ),
pileupParam=PileupParam( distinguish_strands=F, min_base_quality=10, max_depth=1e4 ) )
where bf is bam file, snp is grange object listing all snps found.
> bf
class: BamFile
path: Sample_4.split.bam
index: Sample_4.split.bai
isOpen: FALSE
yieldSize: NA
obeyQname: FALSE
asMates: FALSE
qnamePrefixEnd: NA
qnameSuffixStart: NA
> head(snp)
GRanges object with 6 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
rs13475703 chr1 5124338 + | <NA> A
rs6408157 chr1 9547068 + | <NA> A
ALT QUAL FILTER
<DNAStringSetList> <numeric> <character>
rs13475703 G 233.77 PASS
rs6408157 G 52.77 PASS
-------
seqinfo: 66 sequences from mm10 genome
> res[66,]
seqnames pos nucleotide count which_label
66 chr1 87942764 G 61 chr1:87942764-87942764
For the example above, for the G variant at chr1:87942764, pileup reports 61 reads. When I open the bam file in IGV, I only count 41 reads supporting the G variant.
There are no error messages. Just that the counts reported do not match with what I see in IGV.