1

I have a dataset of SNPs that looks somewhat like this:

           Position   Gprime Score Gene Location     
    SNP 1   500         3.5            NA
    SNP 2   1200        1.2            NA

and a dataset of genes that looks like this

    Name   Interval Start  Interval End  AVG Gprime
    Gene 1   400             1300          NA
    Gene 2   1100            1500          NA

The genes have overlapping intervals, and one gene can have multiple SNPs that fall within it (ex: both SNP1 and SNP2 fall within Gene1) and one SNP can fall into multiple genes (ex: SNP2 is in both Gene1 and Gene2). I want to write an ifelse statement that will take the average Gprime score of all SNPs that fall within a gene region, and then print that score under the AVG Gprime column. I already have code that sorts SNPs based on what genes they fall into and prints the gene name in the SNPset. The problem with this is that the ifelse will only print one gene name, even though one SNP could fall into multiple genes.

Genes$NAME=as.character(Genes$NAME)` ##required to return the name rather than a factor
SNPs$Gene.Location=ifelse(sapply(SNPs$Position,function(p)any(Genes$Low.Interval<=p&Genes$High.Interval>=p)),Genes$NAME,"NO")`
Phil
  • 7,287
  • 3
  • 36
  • 66
krista
  • 11
  • 1
  • 1
    Check `fuzzyjoin` package - https://stackoverflow.com/questions/62912260/mutate-between-dates-from-external-lookup-table/ – Ronak Shah Jul 23 '20 at 08:01

0 Answers0