Calculate average daily value from large data set with R standard format date/times?

Question

I have a dataframe of approximately 10 million rows spanning about 570 days. After using striptime to convert the dates and times, the data looks like this:

          date          X1   
1 2004-01-01 07:43:00 1.2587 
2 2004-01-01 07:47:52 1.2585
3 2004-01-01 17:46:14 1.2586 
4 2004-01-01 17:56:08 1.2585
5 2004-01-01 17:56:15 1.2585

I would like to compute the average value on each day (as in days of the year, not days of the week) and then plot them. Eg. Get all rows which have day "2004-01-01", compute average price, then do the same for "2004-01-2" and so on.

Similarly I would be interested in finding the average monthly value, or hourly price, but I imagine I can work these out once I know how to get average daily price.

My biggest difficulty here is extracting the day of the year from the date variable automatically. How can I cycle through all 365 days and compute the average value for each day, storing it in a list?

I was able to find the average value for day of the week using the weekdays() function, but I couldn't find anything similar for this.

Have you tried `df %>% dplyr::group_by(date) %>% summarise(mean(X1))` for the first part? That should give you the average by day. You can use the lubridate package to get the months to do something similar. — RobertMyles, Jun 29 '17 at 00:38

George Wood · Accepted Answer · 2017-06-29T01:16:41.220

9

Here's a solution using dplyr and lubridate. First, simplify the date by rounding it down to the nearest day-unit using floor_date (see below comment by thelatemail), then group_by date and calculate the mean value using summarize:

library(dplyr)
library(lubridate)

df %>%
  mutate(date = floor_date(date)) %>%
  group_by(date) %>%
  summarize(mean_X1 = mean(X1))

Using the lubridate package, you can use a similar method to get the average by month, week, or hour. For example, to calculate the average by month:

df %>%
  mutate(date = month(date)) %>%
  group_by(date) %>%
  summarize(mean_X1 = mean(X1))

And by hour:

df %>%
  mutate(date = hour(date)) %>%
  group_by(date) %>%
  summarize(mean_X1 = mean(X1))

edited Jun 29 '17 at 01:16

answered Jun 29 '17 at 00:41

George Wood

1,914
17
18

1

Also see [this answer](https://stackoverflow.com/a/23179369/4300478) for calculating the average by date using `aggregate`, `data.table`, or `xts` – George Wood Jun 29 '17 at 00:46
Thanks a lot, this is exactly what I was after (both your answer and the one you linked). Cheers! – Patty Jun 29 '17 at 00:50
1

Be careful suggesting `as.Date(datetime)` - `as.Date` uses the "UTC" timezone for conversions from POSIXct, while the datetime is likely stored in the user's timezone. This can lead to dates at the start or end of periods being attributed the wrong characteristics. E.g. - `as.Date(as.POSIXct("2011-01-01",tz="Australia/Sydney"))` – thelatemail Jun 29 '17 at 00:50
1

Thanks for pointing this out, [thelatemail](https://stackoverflow.com/users/496803/thelatemail). I edited the answer to use `floor_date` instead, which I believe maintains the existing timezone. – George Wood Jun 29 '17 at 01:14
@Thelatemail Ahh this makes sense, I was wondering why I was getting 1 day back from what it was. MUst have been timezone related. – Patty Jun 29 '17 at 01:16
Can the summarize be expanded to multiple columns here? – Lmm Mar 21 '19 at 23:24

score 0 · Answer 2 · answered Jun 29 '17 at 03:38

day of year in lubridate is

yday, as in

lubridate::yday(Sys.time())

because the size of data is big I recommend a data.table approach

library(lubridate)
library(data.table)
df$ydate=yday(df$date)
df=data.table(df)
df[,mean(X1),ydate]

if you want different days for different years as in 1Jan2004 and 1Jan2005

library(lubridate)
library(data.table)
df$ydate=ymd(df$date)
df=data.table(df)
df[,mean(X1),ydate]

Note -instead of using striptime to convert dates you could just use ymd_hms function from lubridate

score 0 · Answer 3 · answered Jan 26 '22 at 14:53

Just to contribute, here is the solution to do it for multiple columns in your data frame. It consists of the same method as George, so a little more is added an using summarise:

new_df <- df %>% mutate(date = hour(date)) %>%
group_by(date) %>%
summarise(across(.cols = where(is.numeric), .fns = ~mean(.x, na.rm = TRUE))

In this case, in ".cols" it is specified that the operation be applied to all columns with numeric format (you can modify it for specific columns). In the ".fns" section you can put the operation you want to perform (mean, sd, etc.) and you can apply na.rm.

Greetings!

Calculate average daily value from large data set with R standard format date/times?

3 Answers3

Linked