R merge based on the range in columns and add values

Question

I have two data frames like this.

 data1=data.frame(begin=c(1,100,50,1000), end=c(100,289,400,2000), type=c(1,1,2,3),Name=c("A","B","C","D"),ID=c("EN1","EN7","EN98","EN63"))

 data2=data.frame(type=c(1,2,2,4), pos=c(98,201,256,230474), Count=c(12,15,2,30))

I need to merge the data frame like this.

 begin end  type  Name ID  Count
   1  100    1     A   EN1  12
 100  289    1     B   EN7  0
  50  400    2     C   EN98 17
1000 2000    3     D   EN63 0

The count is based on the range's added values in each position.

with sqldf I have tried like this, but it does not consider multiple columns for merge

sqldf("select * from data2 left join data1 
            on (data2 == data2.begin and data2.end)

langtang · Answer 1 · 2022-09-11T16:04:56.910

You can use a non-equi join in data.table to join on type, and where pos is between begin and end

Load library and set your frames to data.table

library(data.table)
setDT(data1)
setDT(data2)

Do the non-equi join, leverage by = .EACHI, to do the sum of count.

data2[
  data1,
  on=.(type, pos>=begin, pos<=end),
  .(begin, end,type, Name, ID, Count = sum(Count)),
  by=.EACHI
][, !c("type", "pos", "pos")]

Output:

   begin  end Name   ID Count
1:     1  100    A  EN1    12
2:   100  289    B  EN7    NA
3:    50  400    C EN98    17
4:  1000 2000    D EN63    NA

You don't have to do a non-equi join; For example joins in dplyr dont' allow this, but you can still do a join on type, and then filter the condition of pos being between begin and end:

library(dplyr)
left_join(
  data1, 
  left_join(data1, data2, by="type") %>% 
    rowwise() %>% 
    filter(between(pos,begin,end)) %>% 
    group_by(ID) %>% 
    summarize(sum(Count)), 
  by="ID"
)

Output is the same as above.

In both cases, you can replace the NA in Count, with 0, if you like.

Darren Tsai · Accepted Answer · 2022-09-11T16:35:28.860

You could use fuzzyjoin:

library(dplyr)
library(fuzzyjoin)

fuzzy_left_join(data1, data2,
                by = c('type', 'begin' = 'pos', 'end' = 'pos'),
                match_fun = list(`==`, `<`, `>`)) %>%
  count(across(-c(pos, Count, type.y)), wt = Count)

#   begin  end type.x Name   ID  n
# 1     1  100      1    A  EN1 12
# 2    50  400      2    C EN98 17
# 3   100  289      1    B  EN7  0
# 4  1000 2000      3    D EN63  0

R merge based on the range in columns and add values

2 Answers2