Efficient subsetting in R using 2 dataframes

Question

I have a big time series full in one dataframe and a list of timestamps in a different dataframe test. I need to subset full with data points surrounding the timestamps in test. My first instinct (as an R noob) was to write the below, which was wrong

subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))

Looking at the result I realized that R is looping through both the vectors simultaneously giving the wrong result. My option is to write a loop like the below:

subs<-data.frame()
for (j in test$dt) 
  subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))

I feel that there might be a better way to do loops and this article implores us to avoid R loops as much as possible. The other reason is I might be hitting up against performance issues as this would be at the heart of an optimization algorithm. Any suggestions from gurus would be greatly appreciated.

EDIT:

Here is some reproducible code that shows the wrong approach as well as the approach that works but could be better.

#create a times series
full <- data.frame(seq(1:200),rnorm(200,0,1))
colnames(full)<-c("dt","val")

#my smaller array of points of interest
test <- data.frame(seq(5,200,by=23))
colnames(test)<-c("dt")

# my range around the points of interset
i<-3 

#the wrong approach
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))

#this works, but not sure this is the best way to go about it
subs<-data.frame()
for (j in test$dt) 
  subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))

EDIT: I updated the values to better reflect my usecase, and I see @mrdwab 's solution pulling ahead unexpectedly and by a wide margin.

I am using benchmark code from @mrdwab and the initialization is as follows:

set.seed(1)

full <- data.frame(
  dt  = 1:15000000,
  val = floor(rnorm(15000000,0,1))
)


test <- data.frame(dt = floor(runif(24,1,15000000)))

i <- 500

The benchmarks are:

       test replications elapsed relative
2    mrdwab            2    1.31  1.00000
3 spacedman            2   69.06 52.71756
1    andrie            2   93.68 71.51145
4  original            2  114.24 87.20611

Totally unexpected. Mind = blown. Can someone please shed some light in this dark corner and enlighten as to what is happening.

Important: As @mrdwab notes below, his solution works only if the vectors are integers. If not, @spacedman has the right solution

Welcome to Stack Overflow! If you made a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that demonstrates your question / problem, we would find it easier to answer. — Andrie, Aug 27 '12 at 06:39
You need to clarify what sort of test needs to be done to determine "surrounding" Will there be expected overlaps in the rows selected as might happen with a moving average? Or a fixed distance in time with variable numbers identified in each group? Or ... what? — IRTFM, Aug 27 '12 at 06:42
I am creating the reproducible example. thanks for the comments. Editing it shortly — Generalenthu, Aug 27 '12 at 06:52
@Dwin I just need to filter out the observations that are before and after the point of interest, within a certain timeframe — Generalenthu, Aug 27 '12 at 07:11
+1 for the (somewhat tortuous) reproducible example. See my answer for tips on how to make it **minimal** as well as reproducible. — Andrie, Aug 27 '12 at 07:28
You should *unaccept* my answer and reaccept @Spacedman. The reason my solution is so fast is that there's nothing to match to! My solution (as it is) matches exact numbers, which is what I thought you were looking for from your original post. But with this update shows that since you're using `runif`, my solution won't work *as it is*. — A5C1D2H2I1M1N2O1R2T1, Aug 29 '12 at 04:47
My routine was running in a pinch and was giving the right results. Turns out, it was sloppy initialization on my part for the benchmark. With integers, your solution is much more efficient. Also did add a note about @spacedman's solution for the alternate case. — Generalenthu, Aug 29 '12 at 05:31
@mrdwab, just to be sure I am not doing something really dumb with my benchmark, Can you quickly check mine for accuracy please. The speed is just way off the charts. — Generalenthu, Aug 29 '12 at 05:33
@Generalenthu, no, you're not doing anything wrong. The results are similar to what I get (though your computer is a bit faster than mine ;-) ). I can't explain *why* at the moment--I'll think about it and see what comes to mind. What *quickly* comes to mind is that the other solutions are sort of looping through your "test", whereas my solution creates a sort of modified vectorized test, and R likes vectors. — A5C1D2H2I1M1N2O1R2T1, Aug 29 '12 at 06:07
@Generalenthu, also, be sure to test the output of each using at least something like `all.equal(original, mrdwab)`. You will need to make sure that the sorting is the same in both sets though. Spacedman's solution and mine sorts the result by `dt`, so when comparing, you need to reorder: `all.equal(original[order(original$dt), ], mrdwab)` should give you `TRUE`. — A5C1D2H2I1M1N2O1R2T1, Aug 29 '12 at 06:12

score 6 · Answer 1 · answered Aug 27 '12 at 08:15

6

Here's a real R way to do it. Functionally. No loops...

Starting with Andrie's example data.

First, an interval comparison function:

> cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}

An OR composition function:

> OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}

Now there's sort of a loop here, to construct a list of those comparison functions:

> funs = mapply(cf,test$dt-i,test$dt+i)

Now combine all those into one function:

> anyF = Reduce(OR,funs)

And now we apply the OR composition to our interval testing functions:

> head(full[anyF(full$dt),])
   dt         val
3   3 -0.83562861
4   4  1.59528080
5   5  0.32950777
6   6 -0.82046838
7   7  0.48742905
26 26 -0.05612874

What you've got now is a function of a single variable that tests if the value is in the ranges you defined.

> anyF(1:10)
 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

I don't know if this is faster, or better, or what. Someone do some benchmarks!

answered Aug 27 '12 at 08:15

Spacedman

92,590
12
140
224

Looks like this really more efficient. I am having trouble getting to the right answer though in my algo, something I am missing for sure on my side. Will follow up tomorrow. thanks! – Generalenthu Aug 27 '12 at 09:22
Update: This has been unbelievably fast, 22 seconds runtime on my dataset vs a 90 seconds with my prior version. I am realizing that this is what R being a functional language is all about - extreme amount of flexibility in crafting your functions and have them efficiently chew through your data. Thank you! Wish I could upvote multiple times. – Generalenthu Aug 28 '12 at 03:25
@Spacedman, [benchmarks done](http://stackoverflow.com/a/12137768/1270695) and yes, this was the fastest approach. +1 – A5C1D2H2I1M1N2O1R2T1 Aug 28 '12 at 06:36
Wow I'm surprised! There's another inefficiency still there though. When executing a chain of logical-OR conditions you can stop if any of them are TRUE - there's no need to keep computing the rest of the conditions. Because R works with vectors though, it will do all the OR ops on all the elements of the vectors. There may be a way of coding it so that each OR only works on the elements that haven't been TRUE yet, but then you get into a trade off that depends on how much of your data you expect to end up TRUE... Anyway, hand-coded C could do this and be even faster (I predict ~0.2s) – Spacedman Aug 28 '12 at 07:28
At least in my case its not a a big problem. my `full` is about 15e6 rows, and I am pulling about 20-30 subsets of about 500 elements. into `test`. I am benchmarking to do a single replication and its taking a while probably because the algo from @mrdwab wouldn't work well for my case with really big subsets. Thanks! – Generalenthu Aug 28 '12 at 08:40

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2012-08-28T06:34:37.297

I don't know if it's any more efficient, but I would think you could also do something like this to get what you want:

subs <- apply(test, 1, function(x) c((x-2):(x+2)))
full[which(full$dt %in% subs), ]

I had to adjust your "3" to "2" since x would be included both ways.

Benchmarking (just for fun)

@Spacedman leads the way!

First, the required data and functions.

## Data
set.seed(1)

full <- data.frame(
  dt  = 1:200,
  val = rnorm(200,0,1)
)

test <- data.frame(dt = seq(5,200,by=23))

i <- 3 

## Spacedman's functions
cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
funs = mapply(cf,test$dt-i,test$dt+i)
anyF = Reduce(OR,funs)

Second, the benchmarking.

## Benchmarking
require(rbenchmark)
benchmark(andrie = do.call(rbind, 
                           lapply(test$dt, 
                                  function(j) full[full$dt > (j-i) & 
                                    full$dt < (j+i), ])),
          mrdwab = {subs <- apply(test, 1, 
                                  function(x) c((x-(i-1)):(x+(i-1))))
                    full[which(full$dt %in% subs), ]},
          spacedman = full[anyF(full$dt),],
          original = {subs <- data.frame()
                      for (j in test$dt) 
                        subs <- rbind(subs, 
                                      subset(full, full$dt > (j-i) & 
                                        full$dt < (j+i)))},
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative")
#        test replications elapsed  relative
# 3 spacedman          100   0.064  1.000000
# 2    mrdwab          100   0.105  1.640625
# 1    andrie          100   0.520  8.125000
# 4  original          100   1.080 16.875000

@spacedman please see my updated question with benchmark above. — Generalenthu, Aug 29 '12 at 00:10

score 4 · Answer 3 · answered Aug 27 '12 at 07:24

There is nothing inherently wrong with your code. To achieve your aim, you need a loop of some sort around a vectorised subset operation.

But here is more R-ish way to do it, which might well be faster:

do.call(rbind, 
  lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)

PS: You can significantly simplify your reproducible example:

set.seed(1)

full <- data.frame(
  dt  = 1:200,
  val = rnorm(200,0,1)
)

test <- data.frame(dt = seq(5,200,by=23))

i <- 3 

xx <- do.call(rbind, 
  lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)

head(xx)
   dt         val
3   3 -0.83562861
4   4  1.59528080
5   5  0.32950777
6   6 -0.82046838
7   7  0.48742905
26 26 -0.05612874

Thanks @Andrie. This is my 3rd day with R as well as SO. I am not a programmer by day, but SO has been great — Generalenthu, Aug 27 '12 at 07:37
Depending on what you've programmed in before R you might find my answer better. But I doubt it, unless you've come here from LISP or Haskell or INTERCAL... — Spacedman, Aug 27 '12 at 08:16

score 0 · Answer 4 · answered Aug 22 '13 at 09:55

one more way using data.tables:

{
temp <- data.table(x=unique(c(full$dt,(test$dt-i),(test$dt+i))),key="x")
temp[,index:=1:nrow(temp)]
startpoints <- temp[J(test$dt-i),index]$index
endpoints <- temp[J(test$dt+i),index]$index
allpoints <- as.vector(mapply(FUN=function(x,y) x:y,x=startpoints,y=endpoints))
setkey(x=temp,index)
ans <- temp[J(allpoints)]$x
}

benchmarks: number of rows in test:9 number of rows in full:10000

       test replications elapsed relative
1 spacedman          100   0.406    1.000
2       new          100   1.179    2.904

number of rows in full:100000

       test replications elapsed relative
2       new          100   2.374    1.000
1 spacedman          100   3.753    1.581

Efficient subsetting in R using 2 dataframes

4 Answers4

Benchmarking (just for fun)