-2

I have a dataset with the following structure: Month | Day | Hour | Minute | Value1 | Value2 | Value3

The dataset has a length of 525,600 rows. What I need is the mean over fifteen minutes for each value (value1, value2, value3). The output should have the following structure:

Month | Begin | End  | MeanValues1 | MeanValues2 | MeanValues3 
01    | 0:00  | 0:15 | 1.23        | 2.34        | 3.23 
01    | 0:15  | 0:30 | 1.76        | 3.02        | 3.24

Hence, the output dataset should have a length of 35,040 rows.

Can anybody help me and give me a lightweight solution process for R? I don't know how I can implement that in a very efficient way. Moreover, it is not clear how I can build the Begin and End column in the output dataset.

I thank you in advance for any input.

Best

pogibas
  • 27,303
  • 19
  • 84
  • 117
JohnDong
  • 39
  • 1
  • 2
  • 5
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Mar 20 '18 at 19:48
  • Are the delimeters in your input truly `|` or are they tabs/spaces? – jsta Mar 20 '18 at 20:01
  • The input comes from a csv file. I have only chosen | for my post. In R it is originally a dataframe. – JohnDong Mar 20 '18 at 20:06
  • 1
    In the future, JohnDong, it is far more useful to include data in as consumable a format as possible. Popular formats are shown [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), and include `dput(head(x))`, `read.table(text='...')`, even `read.csv(text='a,b,c\n1,2,3')`. Heck, even raw-csv is easier than asking us to manually or programmatically parse and fix data. I often prefer `read.table(text='...')` because it allows the table to be aesthetically arranged while being easily used by us, but that is certainly not a requirement. – r2evans Mar 20 '18 at 20:09

3 Answers3

0
# Create some sample data
set.seed <- 8675309
Month <- rep(1, 30)
Day <- rep(1, 30)
Hour <- rep(1, 30)
Minute <- seq(1, 30, 1)
value1 <- runif(30, 0, 100)
value2 <- runif(30, 10, 20)
df <- data.frame(Month, Day, Hour, Minute, value1, value2)

# Break into bins
df$bin <- cut(df$Minute, breaks = c(0,15,30,45,60))

# Aggregate
df$Minute<- NULL
newdf <- aggregate(.~Month+Day+Hour+bin, df, mean)
pyll
  • 1,688
  • 1
  • 26
  • 44
  • If you prefer `dplyr` there is a very similar answer and question at: https://stackoverflow.com/questions/32078578/how-to-group-by-every-7-rows-and-aggregate-those-7-values-by-median – jsta Mar 20 '18 at 20:08
  • Thynk you pyll. The break into bins is a very nice solution but the aggregate command do not what I want to do. the ouput is muddled. – JohnDong Mar 21 '18 at 07:41
  • Not sure what you mean by muddled...can you be more specific? – pyll Mar 21 '18 at 09:51
0

Using dplyr and lubridate you can round off times to nearest quarter, group by those, and compute means.

I use these libraries

library(tibble)
library(lubridate)
library(dplyr)

The first just to make this test data:

test_tbl <- tribble(
    ~Month, ~Day, ~Hour, ~Minute, ~Value1, ~Value2, ~Value3,
         2,   15,    14,      11,       1,       1,       1,
         2,   15,    14,      12,       1,       1,       1,
         2,   15,    14,      16,       2,       2,       2,
         2,   15,    14,      19,       2,       2,       2,
)

The pipeline first generate a string that I can parse using lubridate to get a date-time (I don't know if there are smarter ways to do this; I haven't used lubridate that much). Then I compute the beginning and end quarters for all times, group by those, and compute means.

test_tbl %>% 
    mutate(time_str = paste0(2018, ":", Month, ":", Day, " ", Hour, ":", Minute),
           time = ymd_hm(time_str),
           quarter_start = floor_date(time, "15 min"),
           quarter_end = ceiling_date(time, "15 min")) %>%
    select(-(Day:Minute), -time_str, -time) %>% # don't need these any more
    group_by(Month, quarter_start, quarter_end) %>%
    summarise_all(mean)

My output looks like this:

# A tibble: 2 x 6
# Groups:   Month, quarter_start [?]
  Month quarter_start       quarter_end         Value1 Value2
  <dbl> <dttm>              <dttm>               <dbl>  <dbl>
1    2. 2018-02-15 14:00:00 2018-02-15 14:15:00     1.     1.
2    2. 2018-02-15 14:15:00 2018-02-15 14:30:00     2.     2.
# ... with 1 more variable: Value3 <dbl>

You want to change the column names, I suppose, and maybe the format of the quarter_start and quarter_end columns, but otherwise, I think it is close to what you want.

Thomas Mailund
  • 1,674
  • 10
  • 16
  • I like this solution very much, but if he already has the data broken up into non time elements, I wonder if it makes sense to combine them and format. Plus, there is a base R solution that is pretty simple. – pyll Mar 20 '18 at 20:16
  • Thank you so far. I‘m overwhelmed how fast you give me some solution ideas. I will test it and give feedback – JohnDong Mar 20 '18 at 20:20
  • @pyll which base R solution do you mean? – JohnDong Mar 20 '18 at 20:22
  • Yeah, considering that the data is already split, it is probably simpler to map the minutes into quarters and use group_by(Month, Day, Hour, Quarter) and then calculate means... – Thomas Mailund Mar 20 '18 at 20:31
0

Here's a data.table and lubridate approach (using data from the answer by @pyll):

DATA

set.seed <- 8675309
Month <- rep(1, 30)
Day <- rep(1, 30)
Hour <- rep(1, 30)
Minute <- seq(1, 30, 1)
value1 <- runif(30, 0, 100)
value2 <- runif(30, 10, 20)
df <- data.frame(Month, Day, Hour, Minute, value1, value2)

CODE

library(data.table)
library(lubridate)
dt <- as.data.table(df)

# Convert to R date-time (year is assumed to be 2018)
dt <- dt[, date := as.POSIXct(paste(Month, Day, Hour, Minute, sep = " "), 
                          format = "%m %d %H %M", tz = "UTC")]

> head(dt)
   Month Day Hour Minute    value1   value2                date
1:     1   1    1      1 22.260566 16.86117 2018-01-01 01:01:00
2:     1   1    1      2  2.412274 18.53744 2018-01-01 01:02:00
3:     1   1    1      3 87.083359 15.65351 2018-01-01 01:03:00
4:     1   1    1      4 44.914115 15.00117 2018-01-01 01:04:00
5:     1   1    1      5  5.319505 14.06182 2018-01-01 01:05:00
6:     1   1    1      6  2.320090 19.11555 2018-01-01 01:06:00

> str(dt)
Classes ‘data.table’ and 'data.frame':  30 obs. of  7 variables:
 $ Month : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Day   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Hour  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Minute: num  1 2 3 4 5 6 7 8 9 10 ...
 $ value1: num  22.26 2.41 87.08 44.91 5.32 ...
 $ value2: num  16.9 18.5 15.7 15 14.1 ...
 $ date  : POSIXct, format: "2018-01-01 01:01:00" "2018-01-01 01:02:00" "2018-01-01 01:03:00" ...
 - attr(*, ".internal.selfref")=<externalptr>

OUTPUT

> dt[, .(mean1 = mean(value1), mean2 = mean(value2)), 
   by = .(round_date(date, unit = "15 mins"))]

            round_date    mean1    mean2
1: 2018-01-01 01:00:00 27.42124 16.47126
2: 2018-01-01 01:15:00 38.20346 13.91443
3: 2018-01-01 01:30:00 44.70040 16.36477

I think this is a very clean approach, especially ifyou have to perform custom operations (other than mean) for irregular intervals (say 3 min). Also since data.table performs operations by reference, it is quite fast. Hope this helps!

Gautam
  • 2,597
  • 1
  • 28
  • 51