How to subset data by specific hours of interest?

Question

I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.

To summarise, the readings I am interested in are either:

0800, 1200, 1600 =day
2000, 0000, 0400 =night

Recordings between 0800-1600 and 2000-0400 each day should be averaged.

During daylight savings, the equivalent times are:

0900, 1300, 1700 =day
2100, 0100, 0500 =night

Recordings between 0900-1700 and 2100-0500 each day should be averaged.

In the process, I am hoping to subset by site. There are also some NA values or blank cells which should be ignored.

So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!

temperature <- read.csv("SeaTemperatureData.csv", 
                      stringsAsFactors = FALSE) 
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
                                    format="%d/%m/%Y %H:%M",
                                    tz="Pacific/Auckland")
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))

            Date.Time Goat.Island Tawharanui  Kawau Tiritiri Noises
1 2019-06-10 16:00:00      16.820     16.892 16.749   16.677 15.819
2 2019-06-10 20:00:00      16.773     16.844 16.582   16.654 15.796
3 2019-06-11 00:00:00      16.749     16.820 16.749   16.606 15.819
4 2019-06-11 04:00:00      16.487     16.796 16.654   16.558 15.796
5 2019-06-11 08:00:00      16.582     16.749 16.487   16.463 15.867
6 2019-06-11 12:00:00      16.630     16.773 16.725   16.654 15.867

For your DST change, know that `seq(as.POSIXct("2020-03-01 08:00:00"), as.POSIXct("2020-03-31 08:00:00"), by = "day")` correctly compensates for DST. You can use that to determine which days require which (0800 or 0900) time bins. — r2evans, Apr 22 '20 at 01:11
Thanks @r2evans. What do you mean by saying it compensates for DST? It looks to me like this code lists the times between 01/03/20 and 31/03/20 which occur at 08:00? I'm new to this, sorry for my ignorance! — Louise, Apr 28 '20 at 00:42
When I run that `seq(...)` call, I get a `POSIXt` vector with 31 elements. The first 7 show `08:00:00` and the last 24 show `09:00:00`. So for me, I believe it is compensating for DST. More the point, since you stated you wanted different times during DST, I think it would do the same for you. It is based on the current TZ, though (I'm `"US/Pacific"` or `"America/Los_Angeles"`, currently `-07:00`). — r2evans, Apr 28 '20 at 00:54
Okay got you. I don't really want to do this though because I already manually changed the times to be correct following daylight savings. Do you agree? — Louise, Apr 29 '20 at 01:28
Frankly, no, mainly because (1) I prefer programmatic accommodation for things like DST, and (2) I don't know the rest of your setup and needs. — r2evans, Apr 29 '20 at 03:35

dc37 · Answer 1 · 2020-04-30T17:21:01.700

One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.

Here a fake example over 4 days:

library(lubridate)

df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
                 Value = sample(1:100,97, replace = TRUE))

             DateTime Value
1 2020-02-01 00:00:00    99
2 2020-02-01 01:00:00    51
3 2020-02-01 02:00:00    44
4 2020-02-01 03:00:00    49
5 2020-02-01 04:00:00    60
6 2020-02-01 05:00:00    56

Now, you can extract hours with hour function of lubridate and subset for the desired hour:

library(lubridate)

subset(df, hour(DateTime) == 5)

              DateTime Value
6  2020-02-01 05:00:00    56
30 2020-02-02 05:00:00    31
54 2020-02-03 05:00:00    65
78 2020-02-04 05:00:00    80

EDIT: Getting mean of each sites per subset of hours

Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.

Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.

Here, a more elaborated example based on the previous one:

df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
                 Site1 = sample(1:100,97, replace = TRUE),
                 Site2 = sample(1:100,97, replace = TRUE))

             DateTime Site1 Site2
1 2020-02-01 00:00:00   100     6
2 2020-02-01 01:00:00     9    49
3 2020-02-01 02:00:00    86    12
4 2020-02-01 03:00:00    34    55
5 2020-02-01 04:00:00    76    29
6 2020-02-01 05:00:00    41     1
....

So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:

library(lubridate)
library(dplyr)

df %>% mutate(Date = date(DateTime),
              Hour= hour(DateTime),
              Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
  group_by(Date, Category) %>%
  summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))

# A tibble: 9 x 4
# Groups:   Date [5]
  Date       Category Site1 Site2
  <date>     <chr>    <dbl> <dbl>
1 2020-02-01 Daily     56.9  63.1
2 2020-02-01 Night     58.9  46.6
3 2020-02-02 Daily     54.5  47.6
4 2020-02-02 Night     36.9  41.7
5 2020-02-03 Daily     42.3  56.9
6 2020-02-03 Night     44.1  55.9
7 2020-02-04 Daily     54.3  50.4
8 2020-02-04 Night     54.8  34.3
9 2020-02-05 Night     75    16

Does it answer your question ?

Yes, this looks to be along the right lines! Thank you. How would I subset for a list of times e.g. not just 5 but 1,5 and 9? Also, at the same time, is it possible to subset by two variables at once? E.g. apply these times but also subset by site, since I have five sets of values for five different sites across these times. — Louise, Apr 28 '20 at 00:45
For several times points, you can `subset(df, hour(DateTime) %in% c(1.5,5,9))` Please provide a reproducible example of your dataset including different sites (see this link: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — dc37, Apr 28 '20 at 00:50
` Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises 1 2019-05-29 12:00:00 NA 18.105 18.224 NA NA 2 2019-05-29 13:00:00 NA 18.105 18.176 NA NA 3 2019-05-29 14:00:00 NA 18.105 18.176 NA NA 4 2019-05-29 15:00:00 NA 18.129 18.176 17.867 NA 5 2019-05-29 16:00:00 NA 18.152 18.176 17.867 18.152 6 2019-05-29 17:00:00 NA 18.152 18.176 17.819 17.819` — Louise, Apr 29 '20 at 01:38
How do I get the above comment to show as code @dc37? I have more to add but the backticks aren't doing anything. — Louise, Apr 29 '20 at 01:39
Please edit your question to copy/paste your code with `backticks`. — dc37, Apr 29 '20 at 01:51
So, it will be better to provide your code as the output of `dput(df)` — dc37, Apr 29 '20 at 01:53
Does your five sites are `Goat.Island`,`Tawharanui`, ... ? What are you trying to filter ? — dc37, Apr 29 '20 at 02:13
Hello, I have now added in the output of my code in the question. I've subsetted the data following your advice before so now everything is looking good in terms of the data I want to work with. You can see I have temperature readings every four hours at each site. I want to take daily and nightly means at each site. So, in the hours between 08:00:00-17:00:00 each day take a mean at each site, and in the hours between 18:00-07:00 each day take a mean at each site. — Louise, Apr 29 '20 at 23:51

How to subset data by specific hours of interest?

1 Answers1