Averaging the value with respect to time

Question

I have the below dataset with date-time and the corresponding value. The time interval is every 10 mins. I need to generate new rows with 15 mins interval.

For example, for 15:40 the value is 599 and for 15:50 the value is 594, so a new row needs to be generated between the two, i.e 15:45 with average of 599 & 594 which is 596.5

I.e, I need to generate an average between 10 & 20 to get the value for say 16:15; and 40 & 50 to get the value for 16:45. The value for 00, 30 remains the same

Date...Time       RA.CO2
6/15/2017 15:40   599
6/15/2017 15:50   594
6/15/2017 16:00   606
6/15/2017 16:10   594
6/15/2017 16:20   594
6/15/2017 16:30   594
6/15/2017 16:40   594
6/15/2017 16:50   594
6/16/2017 0:00    496.25
6/16/2017 0:10    500
6/16/2017 0:20    496.25
6/16/2017 0:30    496.25
6/16/2017 0:40    600
6/16/2017 0:50    650
6/16/2017 1:00    700 

str(df)
'data.frame':   6092 obs. of  2 variables:
 $ Date...Time: chr  "6/15/2017 15:40" "6/15/2017 15:50" "6/15/2017 16:00" 
 "6/15/2017 16:10" ...
$ RA.CO2     : num  599 594 606 594 594 594 594 594 594 594 ...

Output

 Date...Time          RA.CO2
 6/15/2017 15:45      596.5
 6/15/2017 16:00      606
 6/15/2017 16:15      594
 6/15/2017 16:30      594
 6/15/2017 16:45      594
 6/16/2017 0:00       496.25
 6/16/2017 0:15       498.125
 6/16/2017 0:30       496.25
 6/16/2017 0:45       625
 6/16/2017 1:00       700

I guess you should start by merging your input with a sequence of time every 5 minutes (using `seq`) — Emmanuel-Lin, Sep 08 '17 at 13:15
Can you clarrify how your output works? Why `6/16/2017 0:10` RA.CO2 changed from 496.25 to 606? — Emmanuel-Lin, Sep 08 '17 at 13:24
I need the sequence from 00:00, 10, 15,20,30,40,45,50. The values for for 15mins and 45mins is missing, hence need to get the value for these 2 by averaging the value of time corresponding to 10 & 20; 40 & 50 respectively — Anagha, Sep 11 '17 at 06:02
Your comment is not consistent with the output shown in the question. — G. Grothendieck, Sep 11 '17 at 16:57

www · Accepted Answer · 2017-09-08T13:45:22.750

We can use tidyr to expand the data frame and imputeTS to impute the missing values by linear interpolation.

library(dplyr)
library(tidyr)
library(lubridate)
library(imputeTS)

dt2 <- dt %>%
  mutate(Date...Time = mdy_hm(Date...Time)) %>%
  mutate(Date = as.Date(Date...Time)) %>%
  group_by(Date) %>%
  complete(Date...Time = seq(min(Date...Time), max(Date...Time), by = "5 min")) %>%
  mutate(RA.CO2 = na.interpolation(RA.CO2)) %>%
  ungroup() %>%
  select(Date...Time, RA.CO2)
dt2
# A tibble: 22 x 2
           Date...Time RA.CO2
                <dttm>  <dbl>
 1 2017-06-15 15:40:00  599.0
 2 2017-06-15 15:45:00  596.5
 3 2017-06-15 15:50:00  594.0
 4 2017-06-15 15:55:00  600.0
 5 2017-06-15 16:00:00  606.0
 6 2017-06-15 16:05:00  600.0
 7 2017-06-15 16:10:00  594.0
 8 2017-06-15 16:15:00  594.0
 9 2017-06-15 16:20:00  594.0
10 2017-06-15 16:25:00  594.0
# ... with 12 more rows

My output is not entirely the same as your desired output. This is because:

It is not clear how do you get the values in 6/16/2017 0:10.
Why sometimes the interval is 5 minutes, but sometimes it is 10 minutes?
Why do you include the last three rows? It is also not clear the rules to fill the values of the last three rows.

Nevertheless, I think my solution provides you a possible way to achieve this task. You may need to adjust the code by yourself to fit those unclear rules.

Data

dt <- read.table(text = "Date...Time       RA.CO2
'6/15/2017 15:40'   599
                 '6/15/2017 15:50'   594
                 '6/15/2017 16:00'   606
                 '6/15/2017 16:10'   594
                 '6/15/2017 16:20'   594
                 '6/15/2017 16:30'   594
                 '6/15/2017 16:40'   594
                 '6/15/2017 16:50'   594
                 '6/16/2017 0:00'    496.25
                 '6/16/2017 0:10'    496.25
                 '6/16/2017 0:20'    496.25
                 '6/16/2017 0:30'    496.25",
                 header = TRUE, stringsAsFactors = FALSE)

For the first date i.e 6/15, i'm getting an interval of 20mins — Anagha, Sep 11 '17 at 06:57

G. Grothendieck · Answer 2 · 2017-09-10T12:15:48.910

Here are some solutions. I have re-read the question and am assuming that new intermediate times should only be inserted before times that are 20 or 50 minutes after the hour and in both cases the immediately prior time (before inserting the intermediate time) must be 10 minutes previous. If that is not the intention of the question then it, the vector of intermediate times, will need to be changed from what is shown.

1) zoo Merge df with a data frame having the intermediate times it and then run na.approx from the zoo package on the RA column to fill in the NA values:

library(zoo)

it <- with(df, DT[c(FALSE, diff(DT) == 10) & as.POSIXlt(DT)$min %in% c(20, 50)] - 5 * 60)
M <- merge(df, data.frame(DT = it), all = TRUE)
transform(M, RA = na.approx(RA))

giving:

                    DT     RA
1  2017-06-15 15:40:00 599.00
2  2017-06-15 15:45:00 596.50
3  2017-06-15 15:50:00 594.00
4  2017-06-15 16:00:00 606.00
5  2017-06-15 16:10:00 594.00
6  2017-06-15 16:15:00 594.00
7  2017-06-15 16:20:00 594.00
8  2017-06-15 16:30:00 594.00
9  2017-06-15 16:40:00 594.00
10 2017-06-15 16:45:00 594.00
11 2017-06-15 16:50:00 594.00
12 2017-06-16 00:00:00 496.25
13 2017-06-16 00:10:00 496.25
14 2017-06-16 00:15:00 496.25
15 2017-06-16 00:20:00 496.25
16 2017-06-16 00:30:00 496.25

1a) Note that if df were converted to zoo, i.e. z <- read.zoo(df, tz = ""), then this could be written as just this giving a zoo object result:

na.approx(merge(z, zoo(, it)))

2) approx This one uses no packages. it is from above.

with(df, data.frame(approx(DT, RA, xout = sort(c(DT, it)))))

giving:

                     x      y
1  2017-06-15 15:40:00 599.00
2  2017-06-15 15:45:00 596.50
3  2017-06-15 15:50:00 594.00
4  2017-06-15 16:00:00 606.00
5  2017-06-15 16:10:00 594.00
6  2017-06-15 16:15:00 594.00
7  2017-06-15 16:20:00 594.00
8  2017-06-15 16:30:00 594.00
9  2017-06-15 16:40:00 594.00
10 2017-06-15 16:45:00 594.00
11 2017-06-15 16:50:00 594.00
12 2017-06-16 00:00:00 496.25
13 2017-06-16 00:10:00 496.25
14 2017-06-16 00:15:00 496.25
15 2017-06-16 00:20:00 496.25
16 2017-06-16 00:30:00 496.25

Note: The input used for the above is:

df <- structure(list(DT = structure(c(1497555600, 1497556200, 1497556800, 
1497557400, 1497558000, 1497558600, 1497559200, 1497559800, 1497585600, 
1497586200, 1497586800, 1497587400), class = c("POSIXct", "POSIXt"
)), RA = c(599, 594, 606, 594, 594, 594, 594, 594, 496.25, 496.25, 
496.25, 496.25)), .Names = c("DT", "RA"), row.names = c(NA, -12L
), class = "data.frame")

Update: Have revised assumption of which intermediate times to include.

What should the data type for performing the above operation? In my case the date-time is of the type character and RA is Numeric — Anagha, Sep 11 '17 at 06:09

score 1 · Answer 3 · answered Sep 08 '17 at 13:32

1

Here's a solution using dplyr:

library(dplyr)

df %>%
  # calculate interpolated value between each row & next row
  mutate(DT.next = lead(DT),
         RA.next = lead(RA)) %>%
  mutate(diff = difftime(DT.next, DT)) %>%
  filter(as.numeric(diff) == 10) %>% #keep only 10 min intervals
  mutate(DT.interpolate = DT + diff/2,
         RA.interpolate = (RA + RA.next) / 2) %>%
  # bind to original dataframe & sort by date
  select(DT.interpolate, RA.interpolate) %>%
  rename(DT = DT.interpolate, RA = RA.interpolate) %>%
  rbind(df) %>%
  arrange(DT)

                    DT     RA
1  2017-06-15 15:40:00 599.00
2  2017-06-15 15:45:00 596.50
3  2017-06-15 15:50:00 594.00
4  2017-06-15 15:55:00 600.00
5  2017-06-15 16:00:00 606.00
6  2017-06-15 16:05:00 600.00
7  2017-06-15 16:10:00 594.00
8  2017-06-15 16:15:00 594.00
9  2017-06-15 16:20:00 594.00
10 2017-06-15 16:25:00 594.00
11 2017-06-15 16:30:00 594.00
12 2017-06-15 16:35:00 594.00
13 2017-06-15 16:40:00 594.00
14 2017-06-15 16:45:00 594.00
15 2017-06-15 16:50:00 594.00
16 2017-06-16 00:00:00 496.25
17 2017-06-16 00:05:00 496.25
18 2017-06-16 00:10:00 496.25
19 2017-06-16 00:15:00 496.25
20 2017-06-16 00:20:00 496.25
21 2017-06-16 00:25:00 496.25
22 2017-06-16 00:30:00 496.25

Dataset:

df <- data.frame(
  DT = c(seq(from = as.POSIXct("2017-06-15 15:40"),
             to = as.POSIXct("2017-06-15 16:50"),
             by = "10 min"),
         seq(from = as.POSIXct("2017-06-16 00:00"),
             to = as.POSIXct("2017-06-16 00:30"),
             by = "10 min")),
  RA = c(599, 594, 606, rep(594, 5), rep(496.25, 4))
)

answered Sep 08 '17 at 13:32

Z.Lin

28,055
6
54
94

Nice solution. This is probably better than my solution since only `dplyr` is used. If you want to stay in `dplyr`, you can use `bind_rows` instead of `rbind`. – www Sep 08 '17 at 13:39
And i'm getting the below error : Error in mutate_impl(.data, dots) : Evaluation error: character string is not in a standard unambiguous format. – Anagha Sep 11 '17 at 07:12
If your date variable is of the character format, you should probably convert it into date/time format first. Try `as.POSIXct(DT, format = "%m/%d/%Y")`. If it's a factor, convert it to character before trying the above step. Also, your input ends at 0:30. How does the expected output extend to 1:00? – Z.Lin Sep 11 '17 at 07:24
Thanks, and the above input and output are just the sample of data from my dataset. The actual dataset consists of 6000 data points – Anagha Sep 11 '17 at 07:34
1

@Anagha Then provide a sample of your actual data structure. Look at the answers here; we all have to make assumptions regarding the format of your data, *because you haven't provided that information*. Read [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) if you aren't sure how to do that. – Z.Lin Sep 11 '17 at 07:40
Above is the format of my actual data. The input is every 10mins. I need to get reformatted to every 15mins. I have provided an example also in my question – Anagha Sep 11 '17 at 07:42
1

You still don't get it. You are facing errors regarding the format of your variables. Are they characters? Factors with weirdly ordered levels? Wrongly formatted datetime objects? When you paste the console output of your dataset, we don't see this information. Use `dput()` on your dataset & paste the output. That's the kind of detail required. – Z.Lin Sep 11 '17 at 07:44
I have mentioned in one of the comment. I will anyway update the same – Anagha Sep 11 '17 at 08:22

score 1 · Answer 4 · answered Sep 08 '17 at 13:41

Here is a different idea using zoo library,

library(zoo)

df1 <- df[rep(rownames(df), each = 2),]
df1$DateTime[c(FALSE, TRUE)] <- df1$DateTime[c(FALSE, TRUE)]+5*60
df1$RA.CO2[c(FALSE, TRUE)] <- rollapply(df$RA.CO2, 2, by = 2, mean)

which gives,

               DateTime RA.CO2
1    2017-06-15 15:40:00 599.00
1.1  2017-06-15 15:45:00 596.50
2    2017-06-15 15:50:00 594.00
2.1  2017-06-15 15:55:00 600.00
3    2017-06-15 16:00:00 606.00
3.1  2017-06-15 16:05:00 594.00
4    2017-06-15 16:10:00 594.00
4.1  2017-06-15 16:15:00 594.00
5    2017-06-15 16:20:00 594.00
5.1  2017-06-15 16:25:00 496.25
6    2017-06-15 16:30:00 594.00
6.1  2017-06-15 16:35:00 496.25
7    2017-06-15 16:40:00 594.00
7.1  2017-06-15 16:45:00 596.50
8    2017-06-15 16:50:00 594.00
8.1  2017-06-15 16:55:00 600.00
9    2017-06-16 00:00:00 496.25
9.1  2017-06-16 00:05:00 594.00
10   2017-06-16 00:10:00 496.25
10.1 2017-06-16 00:15:00 594.00
11   2017-06-16 00:20:00 496.25
11.1 2017-06-16 00:25:00 496.25
12   2017-06-16 00:30:00 496.25
12.1 2017-06-16 00:35:00 496.25

I'm facing an error : Warning message: In Ops.factor(JCI_CO2_Com_Final$Date...Time[c(FALSE, TRUE)], 5 * : ‘+’ not meaningful for factors And getting NA's — Anagha, Sep 11 '17 at 05:57
Error in df1$DT[c(FALSE, TRUE)] + 5 * 60 : non-numeric argument to binary operator — Anagha, Sep 11 '17 at 07:31

Averaging the value with respect to time

4 Answers4

Data

Linked