4

So the situation is this: I basically have one data frame where it contains about 100,000 rows of data. I am interested in a particular column of data, POS, and I wanted to check if the value of POS is between two values of another data frame, Start and End, and keep track of how many instances of those are there.

E.g., in my first data frame, I have something like

ID POS  
A   20  
B   533  
C   600 

And in my other data frame, I have stuff like

START      END  
123        150  
489        552  
590        600  

I want to know how many items in POS are in any of the START-END ranges. So in this case, there's be 2 items. Also, if possible, can I get the IDs of the ones with POS between Start and End, too?

How can I go about doing that without having to use a nested for loop?

  • [Efficient way to filter one data frame by ranges in another](https://stackoverflow.com/questions/36454565/efficient-way-to-filter-one-data-frame-by-ranges-in-another); [roll join with start/end window](https://stackoverflow.com/questions/24480031/roll-join-with-start-end-window) – Henrik Mar 29 '18 at 07:58

4 Answers4

6

This is a fairly common problem which might happen in the context of a database. Here is a solution using sqldf:

library(sqldf)

query <- "SELECT POS, ID FROM df1 INNER JOIN df2 "
query <- paste0(query, "ON df1.POS BETWEEN df2.START AND df2.END")
sqldf(query)

If the ranges in your second data frame might overlap, then the above query could return more than one result for a given POS value. In this case, replace SELECT POS with SELECT DISTINCT POS.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • This is an interesting package, never seen this before! Is it possible to use this method and get the IDs whose POS values are in the range between Start and End? – Alex Johanssen Mar 29 '18 at 06:29
  • @AlexJohanssen Yes, just add `ID` to the select list. – Tim Biegeleisen Mar 29 '18 at 06:30
  • Sorry, just one more question. Let's say in the second df there's an ID field I want to include too. How can I go about doing that? – Alex Johanssen Mar 29 '18 at 06:41
  • Just add whatever you need to the `SELECT` list, e.g. `START` or `END`. If the two tables have column names which happen to be the same, then you'll have to update my query to use aliases. Not hard to do, but maybe you don't need it. – Tim Biegeleisen Mar 29 '18 at 06:43
6

We can use a non-equi join with data.table

library(data.table)
setDT(df1)[df2, on = .(POS > START, POS <= END)][, sum(!is.na(ID))]
#[1] 2
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! Is it possible to use this method and get the IDs whose POS values are in the range between Start and End? – Alex Johanssen Mar 29 '18 at 06:26
  • @AlexJohanssen You just need `setDT(df1)[df2, on = .(POS > START, POS <= END)][!is.na(ID)]$ID#[1] "B" "C"` – akrun Mar 29 '18 at 06:27
1

We can achieve the same using mapply in base-R as:

df1[mapply(function(x)any(x >= df2$START & x <= df2$END),df1$POS),]
#  ID POS
#2  B 533
#3  C 600

Data

df1 <- read.table(text = 
"ID POS  
A   20  
B   533  
C   600", header = T)


df2 <- read.table(text = 
"START      END  
123        150  
489        552  
590        600", header = TRUE)
MKR
  • 19,739
  • 4
  • 23
  • 33
1

Data frame: main

ID POS  
A   20  
B   533  
C   600 

Data frame: ran

START   END  
123     150  
489     552  
590     600

A simple sapply should suffice your use case:

sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) })

will return:

[1] 0 1 1

You could bind this back to a new column in your main data frame:

main$Count <- sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) }))

  ID POS count
1  A  20     0
2  B 533     1
3  C 600     1

This should also work with overlapping ranges.

Deepak Rajendran
  • 358
  • 1
  • 11