How to get the rows that one column is same and another column is similar between dataframe?

Question

I have many large dataframes. I want to get the first column is same and second column's difference is less than 5000 between two dataframes. such as:

>a
chr  pos
chr2 10000
chr2 20000
chr2 45000
chr2 60000
chr2 80000
chr2 100000

>b
chr  pos
chr2 10000
chr2 30000
chr2 40000
chr2 55000
chr2 80000

my expected result:

>c
chr  pos
chr2 10000
chr2 45000
chr2 60000
chr2 80000

I tried by this:

c<-data.frame(chr=0, pos=0)
for (i in 1:nrow(b)) {
    c1<-a[(a$chr %in% b[i, 1]) & abs(a$pos-b[i, 2])<=5000, ]
    c<-rbind(c, c1)
  }
c<-c[-1, ]

But it's too slow and bad effiency. I hope to get a better way. Thanks advance!

IceCreamToucan · Accepted Answer · 2018-06-11T15:36:47.117

2

library(data.table)

Create a and b

a <- fread("
chr  pos
chr2 10000
chr2 20000
chr2 45000
chr2 60000
chr2 80000
chr2 100000
", data.table = F)

b <- fread("
chr  pos
chr2 10000
chr2 30000
chr2 40000
chr2 55000
chr2 80000
", data.table = F)

setDT(a)
setDT(b)

@Jaap method, no extra columns needed

a[pos %inrange% b[, .(low = pos - 5000, high = pos + 5000)] & match(chr, b$chr)]
#     chr   pos
# 1: chr2 10000
# 2: chr2 45000
# 3: chr2 60000
# 4: chr2 80000

Join method, requiring extra columns to be added (same output)

Add bounds for considering pos "similar"

a[, `:=`(low = pos - 5000, high = pos + 5000)]

Join b with a according to these bounds.

b[a, .(chr, pos), on = .(pos >= low, pos <= high, chr = chr), nomatch = 0]

SQL method, no extra columns needed (same output)

library(sqldf)

sqldf("
select  a.*
from    a
        inner join b 
          on  a.chr = b.chr
              and b.pos between a.pos - 5000 and a.pos + 5000
")

edited Jun 11 '18 at 15:36

answered Jun 11 '18 at 14:57

IceCreamToucan

28,083
2
22
38

Another possibility with `data.table`: `a[pos %inrange% b[, .(low = pos - 5000, high = pos + 5000)] & match(chr, b$chr)]` – Jaap Jun 11 '18 at 15:15
@Jaap You're right about the `sqldf` output, I've fixed that now. And yeah, that looks like a better `data.table` option. Feel free to add as an answer. – IceCreamToucan Jun 11 '18 at 15:25
Not sure whether it is actually better. Feel free to include it, it is heavily inspired on a [previous answer](https://stackoverflow.com/a/44473835/2204410) I posted. – Jaap Jun 11 '18 at 15:34

How to get the rows that one column is same and another column is similar between dataframe?

1 Answers1