3

As this is my first time asking a question on SO, I apologize in advance for any improper formatting.

I am very new to R and am trying to create a function that will return the row value of a data frame column once a running total in another column has met or exceeded a given value (the row that the running sum begins in is also an argument).

For example, given the following data frame, if given a starting parameter of x=3 and stop parameter of y=17, the function should return 5 (the x value of the row where the sum of y >= 17).

X   Y
1   5
2   10
3   5
4   10
5   5
6   10
7   5
8   10

The function as I've currently written it returns the correct answer, but I have to believe there is a much more 'R-ish' way to accomplish this, instead of using loops and incrementing temporary variables, and would like to learn the right way, rather than form bad habits that I will have to correct later.

A very simplified version of the function:

myFunction<-function(DataFrame,StartRow,Total){
    df<-DataFrame[DataFrame[[1]] >= StartRow,]
    i<-0
    j<-0

    while (j < Total) {
        i<-i+1
        j<-sum(df[[2]][1:i])
    }

    x<-df[[1]][i]
    return(x)
}
Mickäel A.
  • 9,012
  • 5
  • 54
  • 71
user3351605
  • 1,271
  • 3
  • 19
  • 30
  • 1
    I might say that using `while` or `break`ing a loop might be indeed helpful here, since you want the first occurence of an event (especially with large vectors and early occurences). You could, also, avoid computing `j` again and again and, instead, increment it in the loop. – alexis_laz Mar 07 '14 at 19:52
  • My solution below uses @alexis_laz's solution of breaking the loop, and the benchmarking does show it helps with large vectors and early occurrences. Since looping in R is inefficient, I used Rcpp for this computation. – josliber Mar 08 '14 at 01:42

5 Answers5

4

All the solutions posted so far compute the cumulative sum of the entire Y variable, which can be inefficient in cases where the data frame is really large but the index is near the beginning. In this case, a solution with Rcpp could be more efficient:

library(Rcpp)
get_min_cum2 = cppFunction("
int gmc2(NumericVector X, NumericVector Y, int start, int total) {
    double running = 0.0;
    for (int idx=0; idx < Y.size(); ++idx) {
        if (X[idx] >= start) {
            running += Y[idx];
            if (running >= total) {
                return X[idx];
            }
        }
    }
    return -1;  // Running total never exceeds limit
}")

Comparison with microbenchmark:

get_min_cum <- 
 function(start,total) 
   with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_dt <- function(start, total)
   dt[X >= start, X[cumsum(Y) >= total][1]]

set.seed(144)
dat = data.frame(X=1:1000000, Y=abs(rnorm(1000000)))
dt = data.table(dat)
get_min_cum(3, 17)
# [1] 29
get_min_dt(3, 17)
# [1] 29
get_min_cum2(dat$X, dat$Y, 3, 17)
# [1] 29

library(microbenchmark)
microbenchmark(get_min_cum(3, 17), get_min_dt(3, 17),
               get_min_cum2(dat$X, dat$Y, 3, 17))
# Unit: milliseconds
#                               expr        min         lq    median         uq      max neval
#                 get_min_cum(3, 17) 125.324976 170.052885 180.72279 193.986953 418.9554   100
#                  get_min_dt(3, 17) 100.990098 149.593250 162.24523 176.661079 399.7531   100
#  get_min_cum2(dat$X, dat$Y, 3, 17)   1.157059   1.646184   2.30323   4.628371 256.2487   100

In this case, it's about 100x faster to use the Rcpp solution than other approaches.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • +1! I guess this should be efficient nonetheless, since it "cumsum"s and "which"s at the same time – alexis_laz Mar 08 '14 at 01:55
  • @josilber After installing and loading the Rcpp package, I get the following error when trying to declare your function: Error in sourceCpp(code = code, env = env, rebuild = rebuild, showOutput = showOutput, : Error 1 occurred building shared library. WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding: Am I missing a step when working with Rcpp? I don't want to build a package just declare and use the function in an instance. – user3351605 Mar 11 '14 at 11:55
  • It looks like you might need to reboot to get the changes: http://stackoverflow.com/questions/17619185/rcpp-cant-find-rtools-error-1-occurred-building-shared-library – josliber Mar 11 '14 at 14:15
  • @josilber The link you provided led to the answer: I did not realize that Rtools was required for a function written with Rcpp to compile. – user3351605 Mar 11 '14 at 16:37
1

Try this for example, I am using cumsum and vectorized logical subsetting:

 get_min_cum <- 
 function(start,total) 
   with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])

 get_min_cum(3,17) 
 5
agstudy
  • 119,832
  • 17
  • 199
  • 261
1

Here you go (using data.table because of ease of syntax):

library(data.table)
dt = data.table(df)

dt[X >= 3, X[cumsum(Y) >= 17][1]]
#[1] 5
eddi
  • 49,088
  • 6
  • 104
  • 155
1

Well, here's one way:

i <- 3
j <- 17
min(df[i:nrow(df),]$X[cumsum(df$Y[i:nrow(df)])>j])
# [1] 5

This takes df$X for rows i:nrow(df) and indexes that based on cumsum(df$Y) > j, starting also at row i. This returns all df$X for which the cumsum > j. min(...) then returns the smallest value.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
1
with(df, which( cumsum( (x>=3)*y) >= 17)[1] )
IRTFM
  • 258,963
  • 21
  • 364
  • 487