Group or bin rows by bp position values in defined intervals and sum their counts (in R or python)

Question

I have a file of SNPs for two mouse strains (B6 and CAST) in this format:

Chr     Position  B6.SNP CAST.SNP  B6.Count  CAST.Count
chr17   1000      A      G         102       82
chr17   2000      A      G         91        76
chr17   5000      C      T         55        87
chr17   7500      G      A         70        36
chr17   10200     A      G         83        45
chr17   17000     C      T         34        91
chr17   20000     G      A         95        46

I would like to group the SNPs (i.e rows) by 'Chr' and 'Position' in 10,000 bp bins along the chromosome; in other words, group all SNPs that fall within 0-10,000bp, then 10,0001-20,000bp, and so forth). In addition, for all the SNPs that fall within each bin, I want to sum their B6 and CAST counts, as well as create a new column that includes the number of SNPs that fell within each bin.

So perhaps an output file (using the example above):

 Chr    Start   End     SNP.Count   B6.Count.Sum    CAST.Count.Sum
 chr17  0       10000   4           318             281
 chr17  10001   20000   3           212             182

Thanks in advance.

See if [this post](https://stackoverflow.com/questions/11963508/define-and-apply-custom-bins-on-a-dataframe) can help. — Rui Barradas, May 05 '21 at 14:34

Eric Davis · Answer 1 · 2021-05-05T23:48:55.890

Here is a start using GenomicRanges, TxDb, and plyranges in R:

## Read snp data
snps <- read.table(text = "
             Chr     Position  B6.SNP CAST.SNP  B6.Count  CAST.Count
  chr17   1000      A      G         102       82
  chr17   2000      A      G         91        76
  chr17   5000      C      T         55        87
  chr17   7500      G      A         70        36
  chr17   10200     A      G         83        45
  chr17   17000     C      T         34        91
  chr17   20000     G      A         95        46
  
             ", header = T)

## Convert to GRanges object
library(GenomicRanges)
gr <- GRanges(seqnames = snps$Chr, ranges = IRanges(start = snps$Position,
                                                    end = snps$Position))
mcols(gr) <- snps[,3:6]


## Use txdb to assign seqinfo
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene

names_txdb <- seqnames(seqinfo(txdb))
names_gr   <- seqnames(seqinfo(gr))

seqinfo(gr) <- seqinfo(txdb)[names_txdb[names_txdb %in% names_gr]]

## Make windows
windows <- tileGenome(seqinfo(gr), tilewidth = 10000, cut.last.tile.in.chrom = TRUE)

## Use plyranges to summarize by group overlap
library(plyranges)

df <- 
  windows %>%
  group_by_overlaps(gr) %>%
  summarise(B6.Count.Sum = sum(B6.Count),
            CAST.Count.Sum = sum(CAST.Count),
            SNP.Count = n())

binnedGR <- windows[df$query] %>% `mcols<-`(value = df[-1])

## Result as GRanges or as data.frame
binnedGR
as.data.frame(binnedGR)

Group or bin rows by bp position values in defined intervals and sum their counts (in R or python)

1 Answers1