Finding corresponding row in a dataframe based on the value falling within a range columnA:columnB

Question

I have a data.frame and a vector like:

 df = data.frame(id = 1:3,
                 start = c(1, 1000, 16000), 
                 end = c(100, 1100, 16100), 
                 info = c("a", "b", "c"))

vec = cbind(id= 1:150, pos=c(sample(1:100, 50), 
                             sample(1000:1100, 50), 
                             sample(1600:16100, 50)))

For every value of vec I want to find the corresponding row in df where:

vec$pos >= df$start
vec$pos <= df$end
vec$id == df$id

So I can extract the info column.

The problem is that df is 1000 rows long and vec is 2 million values long. Therefore looping over vec using sapply is slow. Can anyone do it by looping over df instead?

pogibas · Accepted Answer · 2019-02-20T14:02:03.677

You can make intervals from vec and use data.table::foverlaps.

library(data.table)

# Make df a data.table and set key
setDT(df)
setkey(df, start, end)

# Turn vector into a data.table with start and end
vec <- data.table(start = vec, end = vec)
setkey(vec, start, end)

# Apply overlaps for each vec entry
# This will get only those vec entries that overlap with df
foverlaps(vec, df, nomatch = NULL)

# Or if you want only info and vec column use:
foverlaps(vec, df, mult = "first", nomatch = NULL)[, .(info, vec = i.start)]

I tested it on dummy data (same dimensions as OPs) and it takes seconds.

df <- data.table(start = sample(1:1e7, 1e3),
                 info  = sample(letters, 1e3, replace = TRUE))
df$end <- df$start + 10
setkey(df, start, end)

vec <- sample(2e6)
vec <- data.table(start = vec, end = vec)
setkey(vec, start, end)

microbenchmark::microbenchmark(
    foverlaps(vec, df, mult = "first", nomatch = NULL)
)

# Unit: seconds
#                                               expr      min       lq     mean   median       uq     max neval
# foverlaps(vec, df, mult = "first", nomatch = NULL) 4.255962 4.274029 4.304148 4.294534 4.329679 4.45406   100

Thanks for the answer it works great but I am having trouble generalizing it. I want to now match on not just overlaps between start and end but also on an ID column. I have updated my question to show you the structure. Do you know how to do this? — Adam Waring, Feb 22 '19 at 11:02
@AdamWaring, just add `id` to `setkey`: `setkey(df, id, start, end)`, `setkey(vec, id start, end)` and it will work. cheers and good luck — pogibas, Feb 22 '19 at 11:23

score 1 · Answer 2 · answered Feb 20 '19 at 13:44

1

sapply(1:nrow(df),function(x){
  i=which(vec>df$start[x] & vec<df$end[x])
  vec[i]<<-df$info[x]
})

This updates your vec to have the information in each position

answered Feb 20 '19 at 13:44

boski

2,437
1
14
30

Thanks for your answer, works fine, however slower than accepted answer. – Adam Waring Feb 20 '19 at 14:05

Finding corresponding row in a dataframe based on the value falling within a range columnA:columnB

2 Answers2

Linked

Related