0

I am trying to subset a data.table by two conditions stored in the same data.table, that are found by a single key.

In practice, I am trying to merge overlapping ranges.

I know how to do this:

 dt[, max := max(localRange), by=someGroup]

However, I want to use a range as selectors in i. So something like:

dt[range > min(localRange) & range < max(localRange), 
   max := max(localRange), 
   by = someGroup]

where range and finalRange are the same column, just range is outside of the scope of .SD.

Or something like:

dt[col2 > dt[,min(col2),by = col1] & col2 < dt[,max(col2),by = col1],
   col2 := max(col2)]

where the two by='s synchronise/share the same col1 value

I have tried it with a for loop using set(), iterating over a list of the min and max range as conditions to the data.table. The list I made using split() on a data.table table:

for (range in split(
                    dt[, 
                         list(min = min(rightBound),max = max(rightBound)),
                         by = leftBound
                        ], 
                    f = 1:nrow(dt[,.GRP,by = leftBound])
                    )
     ){
      set(
            x = dt,
            i = dt[rightBound >= range$min & rightBound <= range$max]
            j = range$max
        )
      }

It all became a mess (even errors), though I assume that this could be a (syntacticly) fairly straightforward operation. Moreover, this is only a case in which there is a single step, getting the conditions associated with the by= group. What if I would want to adjust values based on a series of transformations on a value in by= based on data in the data.table outside of .SD? For example: "by each start, select the range of ends, and based on that range find a range of starts", etc.

Here it does not matter at all that we are talking about ranges, as I think this is generally useful functionality.

In case anybody is wondering about the practical case, user971102 provides fine sample data for a simple case:

my.df<- data.frame(name=c("a","b","c","d","e","f","g"), leftBound=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), rightBound=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)))
dt = as.data.table(my.df)


name  leftBound   rightBound
a             0        71200
b         70001        71200
c         70203        80001
d         70060        71051
e         40004        42004
f      50000872     50000890
g      50000872     51000952

Edit: The IRanges package is going to solve my practical problem. However, I am still very curious to learn a possible solution to the more abstract case of 'chaining' selectors in data.tables

Thanks a bunch Jeremycg and AGstudy. Though it's not the findOverlaps() function, but the reduce() and disjoin() functions.

Stendert
  • 97
  • 7
  • 2
    you want `data.table::foverlaps` or `IRanges::findOverlaps`. share some data for an exact implementation – jeremycg Nov 18 '15 at 14:12
  • Possible duplicate of [collapse intersecting regions in R](http://stackoverflow.com/questions/16957293/collapse-intersecting-regions-in-r) – jeremycg Nov 18 '15 at 14:18
  • Keep your post minimal, show your data, briefly what you've done, and the expected out. There's just too much to read ATM. – Arun Nov 18 '15 at 14:38
  • Okay, thanks. I will edit it. First I will check the functionality recommended above. – Stendert Nov 18 '15 at 14:38
  • rewritten, please reread :) thanks – Stendert Nov 18 '15 at 16:59

0 Answers0