I am trying to subset a data.table by two conditions stored in the same data.table, that are found by a single key.
In practice, I am trying to merge overlapping ranges.
I know how to do this:
dt[, max := max(localRange), by=someGroup]
However, I want to use a range as selectors in i
. So something like:
dt[range > min(localRange) & range < max(localRange),
max := max(localRange),
by = someGroup]
where range and finalRange are the same column, just range
is outside of the scope of .SD
.
Or something like:
dt[col2 > dt[,min(col2),by = col1] & col2 < dt[,max(col2),by = col1],
col2 := max(col2)]
where the two by=
's synchronise/share the same col1 value
I have tried it with a for loop using set()
, iterating over a list of the min and max range as conditions to the data.table. The list I made using split()
on a data.table table:
for (range in split(
dt[,
list(min = min(rightBound),max = max(rightBound)),
by = leftBound
],
f = 1:nrow(dt[,.GRP,by = leftBound])
)
){
set(
x = dt,
i = dt[rightBound >= range$min & rightBound <= range$max]
j = range$max
)
}
It all became a mess (even errors), though I assume that this could be a (syntacticly) fairly straightforward operation. Moreover, this is only a case in which there is a single step, getting the conditions associated with the by=
group.
What if I would want to adjust values based on a series of transformations on a value in by=
based on data in the data.table outside of .SD
? For example: "by each start, select the range of ends, and based on that range find a range of starts", etc.
Here it does not matter at all that we are talking about ranges, as I think this is generally useful functionality.
In case anybody is wondering about the practical case, user971102 provides fine sample data for a simple case:
my.df<- data.frame(name=c("a","b","c","d","e","f","g"), leftBound=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), rightBound=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)))
dt = as.data.table(my.df)
name leftBound rightBound
a 0 71200
b 70001 71200
c 70203 80001
d 70060 71051
e 40004 42004
f 50000872 50000890
g 50000872 51000952
Edit: The IRanges package is going to solve my practical problem. However, I am still very curious to learn a possible solution to the more abstract case of 'chaining' selectors in data.tables
Thanks a bunch Jeremycg and AGstudy. Though it's not the findOverlaps()
function, but the reduce()
and disjoin()
functions.