1

I have two data frames like this.

 data1=data.frame(begin=c(1,100,50,1000), end=c(100,289,400,2000), type=c(1,1,2,3),Name=c("A","B","C","D"),ID=c("EN1","EN7","EN98","EN63"))

 data2=data.frame(type=c(1,2,2,4), pos=c(98,201,256,230474), Count=c(12,15,2,30))

I need to merge the data frame like this.

 begin end  type  Name ID  Count
   1  100    1     A   EN1  12
 100  289    1     B   EN7  0
  50  400    2     C   EN98 17
1000 2000    3     D   EN63 0

The count is based on the range's added values in each position.

with sqldf I have tried like this, but it does not consider multiple columns for merge

sqldf("select * from data2 left join data1 
            on (data2 == data2.begin and data2.end) 
ZenMac
  • 249
  • 1
  • 7

2 Answers2

1

You can use a non-equi join in data.table to join on type, and where pos is between begin and end

  1. Load library and set your frames to data.table
library(data.table)
setDT(data1)
setDT(data2)
  1. Do the non-equi join, leverage by = .EACHI, to do the sum of count.
data2[
  data1,
  on=.(type, pos>=begin, pos<=end),
  .(begin, end,type, Name, ID, Count = sum(Count)),
  by=.EACHI
][, !c("type", "pos", "pos")]

Output:

   begin  end Name   ID Count
1:     1  100    A  EN1    12
2:   100  289    B  EN7    NA
3:    50  400    C EN98    17
4:  1000 2000    D EN63    NA

You don't have to do a non-equi join; For example joins in dplyr dont' allow this, but you can still do a join on type, and then filter the condition of pos being between begin and end:

library(dplyr)
left_join(
  data1, 
  left_join(data1, data2, by="type") %>% 
    rowwise() %>% 
    filter(between(pos,begin,end)) %>% 
    group_by(ID) %>% 
    summarize(sum(Count)), 
  by="ID"
)

Output is the same as above.

In both cases, you can replace the NA in Count, with 0, if you like.

langtang
  • 22,248
  • 1
  • 12
  • 27
1

You could use fuzzyjoin:

library(dplyr)
library(fuzzyjoin)

fuzzy_left_join(data1, data2,
                by = c('type', 'begin' = 'pos', 'end' = 'pos'),
                match_fun = list(`==`, `<`, `>`)) %>%
  count(across(-c(pos, Count, type.y)), wt = Count)

#   begin  end type.x Name   ID  n
# 1     1  100      1    A  EN1 12
# 2    50  400      2    C EN98 17
# 3   100  289      1    B  EN7  0
# 4  1000 2000      3    D EN63  0
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51