1

I have a dataframe (data) that contains a variable with the date+time and some other variables. What I want to have is a new data frame where each row of the "old" df is a summary (e.g., mean) of each instances that fall into the past 15 minutes.

I tackled this with the following code (I shortened the variables to 1, actually I've about 26):

#### SEE EDIT ! ###

library(lubridate)

# Make a reference df to start rbind later
chunks <- data.frame("unix_timestamp" = as.POSIXct("2018-12-01 08:47:00 CET"), 
                       "Var1" = NA)
  # Start loop for each row in data
  for (i in 1:nrow(data)) {

    help <- data[as.POSIXct(data[,1]) > (as.POSIXct(data[i,1]) - minutes(xmin)) & 
                   as.POSIXct(data[,1]) <= as.POSIXct(data[i,1]),] # Help data frame with time frame selection


    chunk <- data.frame("unix_timestamp" = as.POSIXct(data[i,1]), 
                        "Var1" = mean(help$Var1))
    chunks <- rbind(chunks, chunk)
  }

  #Delete initial row
  chunks <- chunks[-1,]

I'm satisfied with the output and when I have a dataframe of ~500 observations the speed is okay. However, I have some data sets with 60,000 rows and this runs almost for ever.

I know others had a similar problem such as here, but unfortunately I was not able to implement it!

I appreciate any help!

Best!

EDIT:


library(lubridate)

data <- data.frame("unix_timestamp" = c("2015-05-01 14:12:57", 
                                        "2015-05-01 14:14:57", 
                                        "2015-05-01 14:15:57", 
                                        "2015-05-01 14:42:57", 
                                        "2015-05-01 14:52:57"), 
                   "Var1" = c(2,3,4,2,1),
                   "Var2" = c(0.53,0.3,0.34,0.12,0.91),
                   "Var3" = c(1,1,1,1,1))
pre <- vector("list", nrow(data))
data

for (i in 1:length(pre)) {
  #to see progress
  print(paste(i, "of", nrow(data), sep = " "))

  help <- data[as.POSIXct(data[,1]) > (as.POSIXct(data[i,1]) - minutes(15)) & 
                 as.POSIXct(data[,1]) <= as.POSIXct(data[i,1]),] # Help data frame with time frame selection


  chunk <- data.frame("unix_timestamp" = as.POSIXct(data[i,1]), 
                      "Var1" = mean(help$Var1),
                      "Var2" = mean(help$Var2),
                      "Var3" = sum(help$Var3))
  pre[[i]] <- chunk
}

output <- do.call(rbind, pre)
output

       unix_timestamp Var1  Var2 Var3
1 2015-05-01 14:12:57  2.0 0.530    1
2 2015-05-01 14:14:57  2.5 0.415    2
3 2015-05-01 14:15:57  3.0 0.390    3
4 2015-05-01 14:42:57  2.0 0.120    1
5 2015-05-01 14:52:57  1.5 0.515    2
Marl
  • 103
  • 1
  • 11
  • 6
    Never ever ever `rbind` in a loop: pre-allocate a `list` as long as you need (`pre <- vector("list", nrow(data))`), assign to each index on each pass (`pre[[i]] <- chunk`), then combine *once* at the end (`chunks <- do.call(rbind, pre)`). Or use `eachchunk <- lapply(seq_len(nrow(data)), function(i) { ... })` and combine with `do.call(rbind, eachchunk)`. – r2evans Aug 10 '19 at 21:41
  • 3
    Why? On each and every call to `rbind`, it is copying all of the data. Every time. When it's only a few rows, this might be relatively negligible, but as they grow, more and more data needs to be copied. – r2evans Aug 10 '19 at 21:41
  • please give some example data, and the expected output, see [mcve] – denis Aug 10 '19 at 21:54
  • 1
    Thank you! I guess with the list-approach it is faster now. However, I'm wondering if the way with the help data frame is the best solution or if I can tune the function even more? – Marl Aug 10 '19 at 22:08
  • Okay well. I tested with a dataframe with 3,000 rows and it was okay. When I run this code with a set with 20,000 rows it need about one (!) second per row! Does anyone has a tip for how to increase the performance? – Marl Aug 10 '19 at 22:18
  • Look into the tsibble package it solves this problem very elegantly – Bruno Aug 10 '19 at 22:19
  • `minutes` is not part of base R. Please include all `library` lines. – Parfait Aug 10 '19 at 22:19
  • Thank you and sorry for my formal mistakes. I updated my code (EDIT) again. Unfortunately the list approach does work with data < 5,000 rows but I've frames with > 20,000 or even 60,000 rows. I also looked at the tsibble package but I cannot see how this improves my solution - however, I didn't event managed to implement a function of that package. Does anyone has another advise? – Marl Aug 11 '19 at 07:36

1 Answers1

0

Since you appear to need a rolling mean consider sapply:

new_df <- data.frame(
            unix_timestamp = as.POSIXct(data[,1]),
            rolling_15_mean = sapply(1:nrow(data), 
                                     function(i) mean(data$Var[as.POSIXct(data[,1]) > (as.POSIXct(data[i,1]) - 60*15)
                                                               & as.POSIXct(data[,1]) <= as.POSIXct(data[i,1])])
            )
          )

Rextester demo

Alternatively, with vapply for slightly faster processing since you define type and length of output vector.

new_df <- data.frame(
            unix_timestamp = as.POSIXct(data[,1]),
            rolling_15_mean = vapply(1:nrow(data), 
                                     function(i) mean(data$Var[as.POSIXct(data[,1]) > (as.POSIXct(data[i,1]) - 60*15)
                                                               & as.POSIXct(data[,1]) <= as.POSIXct(data[i,1])]),
                                     numeric(1)
            )
          )
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thank you. The output is correct but it takes similar as much as time than the solution before with the list approach (However, I cannot see the progress for my df with more than 20,000 rows). I assume it is because I have now 26 individual sapply-functions within the data.frame function, where the "help" data frame is calculated individual for each column. Or am I wrong and I can do a different, more efficient way where the "help" df is calculated just once? – Marl Aug 11 '19 at 07:18
  • At the end of day `sapply` is a loop. This solution more streamlines your process and avoids the bookkeeping of an intermediate list and data frame binding. I added a `vapply` alternative which may help slightly. But 20,000 lookbacks at 15-min incremental filters is a computation challenge. Adding `print` will add to processing time. – Parfait Aug 11 '19 at 13:43