Calculate daily parameters from a dataframe with hourly-values in rows and with several columns of interest

Question

The dataframe df1 summarizes water temperature at different depths (T5m,T15m,T25m,T35m) for every hour (Datetime). As an example of dataframe:

df1<- data.frame(Datetime=c("2016-08-12 12:00:00","2016-08-12 13:00:00","2016-08-12 14:00:00","2016-08-12 15:00:00","2016-08-13 12:00:00","2016-08-13 13:00:00","2016-08-13 14:00:00","2016-08-13 15:00:00"),
                    T5m= c(10,20,20,10,10,20,20,10),
                    T15m=c(10,20,10,20,10,20,10,20),
                    T25m=c(20,20,20,30,20,20,20,30),
                    T35m=c(20,20,10,10,20,20,10,10))
df1$Datetime<- as.POSIXct(df1$Datetime, format="%Y-%m-%d %H")
df1

             Datetime T5m T15m T25m T35m
1 2016-08-12 12:00:00  10   10   20   20
2 2016-08-12 13:00:00  20   20   20   20
3 2016-08-12 14:00:00  20   10   20   10
4 2016-08-12 15:00:00  10   20   30   10
5 2016-08-13 12:00:00  10   10   20   20
6 2016-08-13 13:00:00  20   20   20   20
7 2016-08-13 14:00:00  20   10   20   10
8 2016-08-13 15:00:00  10   20   30   10

I would like to create a new dataframe df2 in which I have the average water temperature per day for either each depth interval and for the whole water column and the standard error estimation. I would expect something like this (I did the calculations by hand so there might be some mistakes):

> df2
        Date meanT5m meanT15m meanT25m meanT35m meanTotal seT5m seT15m seT25m seT35m seTotal
1 2016-08-12      15       15     22.5       15    16.875  2.88   2.88    2.5   2.88    1.29
2 2016-08-13      15       15     22.5       15    16.875  2.88   2.88    2.5   2.88    1.29

I am especially interested in knowing how to do it with data.table since I will work with huge data.frames and I think data.table is quite efficient.

For calculating the standard error I know the function std.error() from the package plotrix.

check this answer : https://stackoverflow.com/a/36527357/10580543 in your case you would summarise by Date and Depth interval instead of location — tom, Aug 06 '19 at 11:32
Thanks @tom. I don't get it very well... Do you know exactly which code should I write? Thanks! — Dekike, Aug 06 '19 at 11:48
I am not familiar enough with data.table to do so in the moment. But your question consists in groupping by and summarising the data, that the post I shared above do with data.table. — tom, Aug 06 '19 at 11:57

thothal · Accepted Answer · 2019-08-07T06:54:56.317

2

Update based on @chinsoon's comment

First transform your data frame into a data table:
```
library(data.table)
setDT(df1)
```

Create a total column:

df1[, total := rowSums(.SD), .SDcols = grep("T[0-9]+m", names(df1))][]
#               Datetime T5m T15m T25m T35m total
# 1: 2016-08-12 12:00:00  10   10   20   20    60
# 2: 2016-08-12 13:00:00  20   20   20   20    80
# 3: 2016-08-12 14:00:00  20   10   20   10    60
# 4: 2016-08-12 15:00:00  10   20   30   10    70
# 5: 2016-08-13 12:00:00  10   10   20   20    60
# 6: 2016-08-13 13:00:00  20   20   20   20    80
# 7: 2016-08-13 14:00:00  20   10   20   10    60
# 8: 2016-08-13 15:00:00  10   20   30   10    70

Apply the functions per day:

library(lubridate)
(df3 <- df1[, as.list(unlist(lapply(.SD, function (x)
             c(mean = mean(x), sem = sd(x) / sqrt(length(x)))))), 
       day(Datetime)])
#    day T5m.mean  T5m.sem T15m.mean T15m.sem T25m.mean T25m.sem T35m.mean
# 1:  12       15 2.886751        15 2.886751      22.5      2.5        15
# 2:  13       15 2.886751        15 2.886751      22.5      2.5        15
#    T35m.sem total.mean total.sem
# 1: 2.886751       67.5  4.787136
# 2: 2.886751       67.5  4.787136

edited Aug 07 '19 at 06:54

answered Aug 06 '19 at 12:00

thothal

16,690
3
36
71

Hi @thothal, do you know why the `total` mean is wrong? It should be `16.875` instead of `67.5`. The mean represents the overall mean for the day considering the different depths and hours. – Dekike Aug 06 '19 at 12:13
1

So in my example `total.mean` is the mean of column `total` that is the sum of all depth per `hour`. So I guess you should replace `rowSums` by `rowMeans` in step 2. – thothal Aug 06 '19 at 12:16
1

you might want to see https://stackoverflow.com/questions/29620783/apply-multiple-functions-to-multiple-columns-in-data-table/29621821#29621821 and the comments, hence `f <- function(x) c(mean=mean(x), sem=sd(x) / sqrt(length(x))); df1[, as.list(unlist(lapply(.SD, f))), by=mday(Datetime), .SDcols=T5m:T35m]` – chinsoon12 Aug 07 '19 at 06:05
@chinsoon: thanks very enlightening, updated my answer +1 – thothal Aug 07 '19 at 06:55

score 1 · Answer 2 · answered Aug 06 '19 at 11:41

Here is one way using dplyr and tidyr calculated in two parts

library(dplyr)
library(tidyr)

df2 <- df1 %>%
         mutate(Datetime = as.Date(Datetime)) %>%
         gather(key, value, -Datetime) %>%
         group_by(Datetime, key) %>%
         summarise(se = plotrix::std.error(value), 
                  mean = mean(value)) %>%
         gather(total, value, -key,  -Datetime) 


bind_rows(df2, df2 %>%  
                 group_by(Datetime, total) %>%
                 summarise(value = sum(value)) %>%
                 mutate(key = paste("total", c("mean", "se"), sep = "_"))) %>%
           unite(key, key, total) %>%
           spread(key, value) 


# A tibble: 2 x 11
# Groups:   Datetime [2]
#  Datetime   T15m_mean T15m_se T25m_mean T25m_se T35m_mean
#  <date>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>
#1 2016-08-12        15    2.89      22.5     2.5        15
#2 2016-08-13        15    2.89      22.5     2.5        15
# … with 5 more variables: T35m_se <dbl>, T5m_mean <dbl>,
#   T5m_se <dbl>, total_mean_mean <dbl>, total_se_se <dbl>

Thanks @Ronak Shah, do you know how to include also the mean considering all the depths? — Dekike, Aug 06 '19 at 11:49
@Dekike I have already included that in the calculation right? `summarise(se = plotrix::std.error(value), mean = mean(value))` — Ronak Shah, Aug 06 '19 at 12:46
Sorry, I did't realize that there are 5 more columns hidden. I thought that what you showed was the final display with all the parameters. You are right! — Dekike, Aug 06 '19 at 12:50

Calculate daily parameters from a dataframe with hourly-values in rows and with several columns of interest

2 Answers2