How to filter a data set and calculate a new variable faster in R?

Question

I have a data set with values every minute and I want to calculate the average value for every hour. I have tried by using the group_by(), filter() and summarise() from dplyr package to reduce the data every hour. When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.

> head(DF)
         datetime        pw        cu          year m  d hr min
1 2017-08-18 14:56:00 0.0630341 1.94065        2017 8 18 14  53
2 2017-08-18 14:57:00 0.0604653 1.86771        2017 8 18 14  57
3 2017-08-18 14:58:00 0.0601318 1.86596        2017 8 18 14  58
4 2017-08-18 14:59:00 0.0599276 1.83761        2017 8 18 14  59
5 2017-08-18 15:00:00 0.0598998 1.84177        2017 8 18 15   0

I had to use a for loop to reduce my table, I wrote the following to do it:

datetime <- c()
eg_bf <-c ()

for(i in 1:8760){  
    hour= start + 3600

    DF= DF %>% 
      filter(datetime >= start & datetime < hour) %>% 
      summarise(eg= mean(pw))

    datetime= append(datetime, start)
    eg_bf= append(eg_bf, DF$eg)

    start= hour
    }
new_DF= data.frame(datetime, eg_bf)

So. I was able to get my new data set with the mean value for every hour of the year.

  datetime             eg_bf
1 2018-01-01 00:00:00  0.025
2 2018-01-01 01:00:00  0.003
3 2018-01-01 02:00:00  0.002
4 2018-01-01 03:00:00  0.010
5 2018-01-01 04:00:00  0.015

The problem I'm facing is that It takes a lot of time to do it. The idea is to add this calculation to a shiny UI, so every time I make a change it must make the changes faster. Any idea how to improve this calculation?

score 2 · Accepted Answer · answered Feb 11 '19 at 09:45

2

you can try this. use make_date from the lubridate package to make a new date_time column using the year , month, day and hour columns of your dataset. Then group and summarise on the new column

library(dplyr)
library(lubridate)
 df %>% 
   mutate(date_time = make_datetime(year, m, d, hr)) %>%  
   group_by(date_time) %>% 
   summarise(eg_bf = mean(pw))

answered Feb 11 '19 at 09:45

Adam Gruer

126
1
4

Just what I was looking for. Thank you! – HuHu Feb 11 '19 at 10:07

score 2 · Answer 2 · answered Feb 11 '19 at 09:53

@Adam Gruer's answer provides a nice solution for the date variable that should solve your question. The calculation of the mean per hour does work with just dplyr, though:

df %>%
  group_by(year, m, d, hr) %>%
  summarise(test = mean(pw))

# A tibble: 2 x 5
# Groups:   year, m, d [?]
   year     m     d    hr   test
  <int> <int> <int> <int>  <dbl>
1  2017     8    18    14 0.0609
2  2017     8    18    15 0.0599

You said in your question:

When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.

What did you do differently?

Yes, the problem was that I used filter() for the year and then I grouped by month or hr. I was not including the m,d and hr in group_by() — HuHu, Feb 11 '19 at 12:14

score 1 · Answer 3 · answered Feb 11 '19 at 10:40

Even if you've found your answer, I believe this is worth mentioning:

If you're working with a lot of data and speed is an issue, then you might want ot see if you can use data.table instead of dplyr

You can see with a simple benchmarking how much faster data.table is:

library(dplyr)
library(lubridate)
library(data.table)
library(microbenchmark)
set.seed(123)

# dummy data, one year, one entry per minute
# first as data frame
DF <- data.frame(datetime = seq(as.POSIXct("2018-01-01 00:00:00"), 
                                as.POSIXct("2019-01-02 00:00:00"), 60),
                 pw = runif(527041)) %>% 
  mutate(year = year(datetime), m=month(datetime), 
         d=day(datetime), hour = hour(datetime))

# save it as a data.table
dt <- as.data.table(DF)

# transformation with dplyr
f_dplyr <- function(){
  DF %>% 
    group_by(year, m, d, hour) %>% 
    summarize(eg_bf = mean(pw))
}


# transformation with data.table
f_datatable <- function() {
  dt[, mean(pw), by=.(year, m, d, hour)]
}

# benchmarking
microbenchmark(f_dplyr(), f_datatable())

# 
# Unit: milliseconds
#          expr       min        lq     mean   median       uq      max neval cld
#     f_dplyr() 41.240235 44.075019 46.85497 45.64998 47.95968 76.73714   100   b
# f_datatable()  9.081295  9.712694 12.53998 10.55697 11.33933 41.85217   100  a

check out this post it tells a lot data.table vs dplyr: can one do something well the other can't or does poorly?

score 0 · Answer 4 · answered Feb 11 '19 at 10:27

As I understood you have a data frame of 365 * 24 * 60 rows. The code below returns the result instantly. The outcome is mean(pw) grouped by every hour of the year.


remove(list = ls())

library(dplyr)
library(lubridate)
library(purrr)
library(tibble)

date_time <- seq.POSIXt(
    as.POSIXct("2018-01-01"),
    as.POSIXct("2019-01-01"),
    by = "1 min"
)

n <- length(date_time)

data <- tibble(
    date_time = date_time,
    pw = runif(n),
    cu = runif(n),
    ye = year(date_time),
    mo = month(date_time),
    da = day(date_time),
    hr = hour(date_time)
)

grouped <- data %>% 
    group_by(
        ye, mo, da, hr
    ) %>% 
    summarise(
        mean_pw = mean(pw)
    )

How to filter a data set and calculate a new variable faster in R?

4 Answers4