0

I have two dataframes. The first dataframe is a list of genetic variants, their identifiers, and their position on a chromosome. The second is a list of genes where columns in each row specify the start and stop position of a gene on a chromosome.

I want to see which genetic variants fall within a gene's 'range' denoted by the start_20 and stop_20 cols. A genetic variant may fall into the range of more than 1 gene. For example, here snp "rs1" will map to gene A and gene B.

This is what I have tried so far:

df of genes ranges

chromosome<-c("1", "1", "2")
start_20<-c("1", "1", "5")  
stop_20<-c("4", "4", "6")  
gene<-c("A", "B", "C") 

genelist=data.frame(chromosome, start_20,  stop_20, gene,stringsAsFactors=F  )

df of snps and their positions

chromosome<-c("1", "2")
snp<-c("rs1", "rs2")
position<-c("3", "5") 

snplist=data.frame(chromosome,snp,position,stringsAsFactors=F)

Aim is to match snps to genes via base pair positions (i.e snp 1 has a position of '3' meaning it maps to gene A and gene B).

genelist.bychrome <- vector("list", 2)

List of genes by chromosome.

for(i in 1:2) genelist.bychrome[[i]] <- genelist[genelist[,"chromosome"]==i,]  

Empty container of length nrow(snplist) Put matched genes in here if you find one

gene.matched <- rep("",nrow(snplist))

gene.matched<-as.list(gene.matched)

#looping across each observation in snplist

    for(i in 1:nrow(snplist)){

# snplist[i,"chromosome"] is the chromosome of interest
# Because of consecutive ordering genelist.bychrome[[3]] gives the genelist       for chromosome 3

 Therefore, genelist.bychrome[[ snplist[i,"chromosome"] ]] gives the genelist for the chromosome of interest

 VERY IMPORTANT: get.gene gives the index in genelist.bychrome[[     snplist[i,"chromosome"] ]], NOT genelist

        if(snplist[i,"chromosome"] <= 1){ 

                get.gene<- which((genelist.bychrome[[ snplist[i,"chromosome"] ]][,"stop_20"] >= snplist[i,"position"])  &    

            # get matching list element of genelist.bychrome 
            # in this element collect indices for rows where stop position is greater than the postion of the snp and
            # start position is less than the position of the snp
            # this should collect multiple rows for some snps   
            # dump the gene for this index in the matching element of gene.matched 
            # i.e get.gene<- which(genelist.bychrome[[1]]  [,"stop_20"] >= snplist[1,3])  & (genelist.bychrome[[1]]  [,"start_20"] <= snplist[1,3])
            # gene.matched <- genelist.bychrome[[1]][get.gene,"gene"]


                    ( genelist.bychrome[[ snplist[i,"chromosome"] ]][,"start_20"] <= snplist[i,"position"])) # correct                  
                        if(length(get.gene)!=0) gene.matched[i]<- genelist.bychrome[[ snplist[i,"chromosome"] ]][get.gene,"gene"]
                             } 

                                    } # end for()

#bind the matched genes to the snplist
    snplist.new <- cbind(snplist,gene.matched)

Any tips would be much appreciated! Thank you.

zx8754
  • 52,746
  • 12
  • 114
  • 209
avari
  • 105
  • 5
  • You say, "snp 1 has a position of '3' meaning it maps to gene 7 and 8". How do you know position 3 maps to 7 & 8? – User7598 Apr 26 '16 at 17:16
  • Hello, sorry I just realized that was a mistake in my question (I wrote a new version to make it more intuitive). Gene 7 and 8 should now read gene A and gene B. – avari Apr 26 '16 at 21:08

2 Answers2

1

I believe your problem is in the which statement inside the For loop. As it is this line would work as intended if you add as.numeric(). Try this, genelist.bychrome[[as.numeric(snplist[i,"chromosome"]) ]]

But in general, I'd suggest defining numeric vectors as numeric data types unless you have a reason to do otherwise, such as your chromosome vector can be defined as c(1,1,2) instead of c("1","1","2"). And also use as.numeric() whenever you're working with numeric data defined as strings, such as when referring to list index, comparison operations, etc.

let me know if this works.

Dhiraj Shekar
  • 361
  • 3
  • 5
  • Thanks that indeed helped! I will make sure I remember to do that in the future. However now the vectors in the loop don't match up as I get the 'number of items to replace is not a multiple of replacement length' message. Given that a snp may match to more than one gene I want gene.matched to be list of character vectors of length 'snplist' like so: gene.matched <- list(c("geneA", "geneB"), c("geneC")) which I can append to the snplist. Having some trouble trying to do this ... – avari Apr 27 '16 at 11:54
0

Update: issue resolved by removing the first if statement, casting the vectors 'as.numeric' as suggested, and making a empty list of no predetermined length at the start (which may be not always be a good idea I would guess).

Thanks!

#make data frames

     genelist = data.frame(chromosome=c(1,1,2),start_20=c(1, 1, 5), stop_20=c(4, 4, 6), gene=c("A", "B", "C"), stringsAsFactors=F)

     snplist=data.frame(chromosome=c(1,2),snp=c("rs1", "rs2"),position=c(3,5),stringsAsFactors=F)

 #objective is to get genes per snp

    genelist.bychrome <- vector("list", 2)
    for(i in 1:2) genelist.bychrome[[i]] <-   genelist[genelist[,"chromosome"]==i,]  

    gene.matched <- list()

 #looping across each observation in snplist

    for(i in 1:nrow(snplist)){

                get.gene<-  which((genelist.bychrome[[as.numeric(snplist[i,"chromosome"]) ]] [,"stop_20"] >=  snplist[i,"position"])  &                         
                    (genelist.bychrome[[as.numeric(snplist[i,"chromosome"]) ]] [,"start_20"] <= snplist[i,"position"]))

                        if(length(get.gene)!=0) gene.matched[[i]]<- genelist.bychrome[[ snplist[i,"chromosome"] ]][get.gene,"gene"]

                            }                   

    names(gene.matched)=snplist$snp
avari
  • 105
  • 5