1

I have a large datasets (d1)like this below:

SNP Position    Chromosome
rs1 10010   1
rs2 10020   1
rs3 10030   1
rs4 10040   1
rs5 10010   2
rs6 10020   2
rs7 10030   2
rs8 10040   2
rs9 10010   3
rs10 10020  3
rs11 10030  3
rs12 10040  3

I also have a dataset(d2) like below:

SNP Position    Chromosome
rsA 10015   1    
rsB 10035   3

Now, i want to select a range of SNPs in d1 based on the d2(Position+-5 and the same chromosome), and write the results to txt file, the results should be like this:

SNP(d2) SNP(d1) Position(d1)    Chromosome
rsA rs2 10020   1
rsA rs3 10030   1
rsB rs11    10030   3
rsB rs12    10040   3 

I am new in R, can anyone please tell me how to do that in R? You kind of reply are highly appreciated.

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
user2669497
  • 157
  • 2
  • 11
  • People are likely to downvote your question until you show us what you've tried, or at the very least provide a reproducible example (see here: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example ) – Brandon Bertelsen Aug 10 '13 at 00:03

2 Answers2

3

Doing a merge by "Chromosome" column (like joining 2 tables in a database on that column):

mrg <- merge(x = d1, y = d2, by = c("Chromosome"), all.y = TRUE)

Then filtering rows where positions diff <= 5:

result <- mrg[abs(mrg$Position.x - mrg$Position.y) <= 5,]

Will give you the desired result.

manji
  • 47,442
  • 5
  • 96
  • 103
1
d2$low <- d2$Position-5 ; d2$high<- d2$Position+5

You might think that you could do somehting like :

d2$matched <- which(d1$Position >=d2$low & d2$high >= d1$Position)

.... but not really, so you need something a bit more involved.:

 d1$matched <- apply(d1, 1, function(p) 
                            which(p['Position'] >=d2[,'low'] & 
                                  d2[,'high'] >= p['Position'] & 
                                  p['Chromosome']==d2[,"Chromosome"]) )

Basically this is checking on each row of d1 whether there is a potential match in d2 in range and on hte same chromosome:

 d1        # take a look
 # Then bind matching cases together
 cbind( d1[ which(d1$matched > 0), ], 
        d2[ unlist(d1$matched[which(d1$matched>0)]), ] )
#--------------------
    SNP Position Chromosome matched SNP Position Chromosome   low  high
1   rs1    10010          1       1 rsA    10015          1 10010 10020
2   rs2    10020          1       1 rsA    10015          1 10010 10020
11 rs11    10030          3       2 rsB    10035          3 10030 10040
12 rs12    10040          3       2 rsB    10035          3 10030 10040
IRTFM
  • 258,963
  • 21
  • 364
  • 487