Using R to select data based on another dataset

Question

I have a large datasets (d1)like this below:

SNP Position    Chromosome
rs1 10010   1
rs2 10020   1
rs3 10030   1
rs4 10040   1
rs5 10010   2
rs6 10020   2
rs7 10030   2
rs8 10040   2
rs9 10010   3
rs10 10020  3
rs11 10030  3
rs12 10040  3

I also have a dataset(d2) like below:

SNP Position    Chromosome
rsA 10015   1    
rsB 10035   3

Now, i want to select a range of SNPs in d1 based on the d2(Position+-5 and the same chromosome), and write the results to txt file, the results should be like this:

SNP(d2) SNP(d1) Position(d1)    Chromosome
rsA rs2 10020   1
rsA rs3 10030   1
rsB rs11    10030   3
rsB rs12    10040   3

I am new in R, can anyone please tell me how to do that in R? You kind of reply are highly appreciated.

People are likely to downvote your question until you show us what you've tried, or at the very least provide a reproducible example (see here: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example ) — Brandon Bertelsen, Aug 10 '13 at 00:03

manji · Answer 1 · 2013-08-10T10:32:36.413

3

Doing a merge by "Chromosome" column (like joining 2 tables in a database on that column):

mrg <- merge(x = d1, y = d2, by = c("Chromosome"), all.y = TRUE)

Then filtering rows where positions diff <= 5:

result <- mrg[abs(mrg$Position.x - mrg$Position.y) <= 5,]

Will give you the desired result.

edited Aug 10 '13 at 10:32

answered Aug 10 '13 at 00:35

manji

47,442
5
96
103

IRTFM · Accepted Answer · 2013-09-01T17:48:23.187

d2$low <- d2$Position-5 ; d2$high<- d2$Position+5

You might think that you could do somehting like :

d2$matched <- which(d1$Position >=d2$low & d2$high >= d1$Position)

.... but not really, so you need something a bit more involved.:

 d1$matched <- apply(d1, 1, function(p) 
                            which(p['Position'] >=d2[,'low'] & 
                                  d2[,'high'] >= p['Position'] & 
                                  p['Chromosome']==d2[,"Chromosome"]) )

Basically this is checking on each row of d1 whether there is a potential match in d2 in range and on hte same chromosome:

 d1        # take a look
 # Then bind matching cases together
 cbind( d1[ which(d1$matched > 0), ], 
        d2[ unlist(d1$matched[which(d1$matched>0)]), ] )
#--------------------
    SNP Position Chromosome matched SNP Position Chromosome   low  high
1   rs1    10010          1       1 rsA    10015          1 10010 10020
2   rs2    10020          1       1 rsA    10015          1 10010 10020
11 rs11    10030          3       2 rsB    10035          3 10030 10040
12 rs12    10040          3       2 rsB    10035          3 10030 10040

Using R to select data based on another dataset

2 Answers2

Linked