Include in data frame 1 averaged values from several other data frames and based on varying time intervals

Question

I have a data frame with several variables, and whose first columns look like this:

Place <- c(rep("PlaceA",14),rep("PlaceB",15))
Group_Id <- c(rep("A1",5),rep("A1",6),rep("A2",3),rep("B1",6),rep("B2",4),rep("B2",5))
Time <- as.Date(c("2018-01-15","2018-02-03","2018-02-27","2018-03-10","2018-03-18","2019-02-02","2019-03-01","2019-03-15","2019-03-28","2019-04-05","2019-04-12","2018-02-01",
                  "2018-03-01","2018-04-07","2018-01-17","2018-01-27","2018-02-17","2018-03-03","2018-04-02","2018-04-25","2018-03-03","2018-03-18","2018-04-08","2018-04-20",
                  "2019-01-23","2019-02-09","2019-02-27","2019-03-12","2019-03-30"))
FollowUp <- c("start",paste("week",week(ymd(Time[2:5]))),"start",paste("week",week(ymd(Time[7:11]))),"start",paste("week",week(ymd(Time[13:14]))),"start",paste("week",week(ymd(Time[16:20]))),"start",paste("week",week(ymd(Time[22:24]))),"start",paste("week",week(ymd(Time[26:29]))))
exprmt <- c(rep(1,5),rep(2,6),rep(3,3),rep(4,6),rep(5,4),rep(6,5))

> df1
    Place Group_Id       Time exprmt FollowUp
1  PlaceA       A1 2018-01-15      1    start
2  PlaceA       A1 2018-02-03      1   week 5
3  PlaceA       A1 2018-02-27      1   week 9
4  PlaceA       A1 2018-03-10      1  week 10
5  PlaceA       A1 2018-03-18      1  week 11
6  PlaceA       A1 2019-02-02      2    start
7  PlaceA       A1 2019-03-01      2   week 9
8  PlaceA       A1 2019-03-15      2  week 11
9  PlaceA       A1 2019-03-28      2  week 13
10 PlaceA       A1 2019-04-05      2  week 14
11 PlaceA       A1 2019-04-12      2  week 15
12 PlaceA       A2 2018-02-01      3    start
13 PlaceA       A2 2018-03-01      3   week 9
14 PlaceA       A2 2018-04-07      3  week 14
15 PlaceB       B1 2018-01-17      4    start
16 PlaceB       B1 2018-01-27      4   week 4
17 PlaceB       B1 2018-02-17      4   week 7
18 PlaceB       B1 2018-03-03      4   week 9
19 PlaceB       B1 2018-04-02      4  week 14
20 PlaceB       B1 2018-04-25      4  week 17
21 PlaceB       B2 2018-03-03      5    start
22 PlaceB       B2 2018-03-18      5  week 11
23 PlaceB       B2 2018-04-08      5  week 14
24 PlaceB       B2 2018-04-20      5  week 16
25 PlaceB       B2 2019-01-23      6    start
26 PlaceB       B2 2019-02-09      6   week 6
27 PlaceB       B2 2019-02-27      6   week 9
28 PlaceB       B2 2019-03-12      6  week 11
29 PlaceB       B2 2019-03-30      6  week 13

For each Place (more than 2 in my actual data), I have a separate data frame with temperature records by hours. For example:

set.seed(1032)
t <- c(seq.POSIXt(from = ISOdate(2018,01,01),to = ISOdate(2018,06,01), by = "hour"),seq.POSIXt(from = ISOdate(2019,01,01),to = ISOdate(2019,06,01), by = "hour"))
temp_A <- runif(length(t),min = 5, max = 25)
temp_B <- runif(length(t),min = 3, max = 32)
data_A <- data.frame(t,temp_A)
data_B <- data.frame(t,temp_B)

> head(data_A)
                    t   temp_A
1 2018-01-01 12:00:00 14.24961
2 2018-01-01 13:00:00 21.64925
3 2018-01-01 14:00:00 21.77058
4 2018-01-01 15:00:00 13.31673
5 2018-01-01 16:00:00 16.10350
6 2018-01-01 17:00:00 17.64567

I need to add a column in df1 with average temperature for the time interval by Place, group_Id and exprmt: the first of each group_byshould be a NaN, than I would need the average for each time interval. Knowing that for each Place, the data are also in a separate data frame.

I tried something like this, but it is not working:

df1 <- df1 %>% group_by(Place,Group_Id,exprmt) %>% mutate(
  temp = case_when(FollowUp == "start" & Place == "PlaceA" ~ NA,
                   FollowUp == FollowUp[c(2:n())] & Place == "PlaceA" ~ mean(temp_A[c(which(date(temp_A$t))==lag(Time,1):which(date(temp_A$t))==Time),2]),
                   )
)

I found information on how calculate averages over multiple dataframes (e.g. this or this), but this is not what I am looking for. I would like to do it without a loop. My expected results is (etc stand for and so on..):

> df1
    Place Group_Id       Time exprmt FollowUp                                      expected
1  PlaceA       A1 2018-01-15      1    start                                           NaN
2  PlaceA       A1 2018-02-03      1   week 5 mean temp_A between 2018-01-15 and 2018-02-03
3  PlaceA       A1 2018-02-27      1   week 9 mean temp_A between 2018-02-03 and 2018-02-27
4  PlaceA       A1 2018-03-10      1  week 10 mean temp_A between 2018-02-27 and 2018-03-10
5  PlaceA       A1 2018-03-18      1  week 11 mean temp_A between 2018-03-10 and 2018-03-18
6  PlaceA       A1 2019-02-02      2    start                                           NaN
7  PlaceA       A1 2019-03-01      2   week 9 mean temp_A between 2019-02-02 and 2019-03-01
8  PlaceA       A1 2019-03-15      2  week 11                                           etc
9  PlaceA       A1 2019-03-28      2  week 13                                           etc
10 PlaceA       A1 2019-04-05      2  week 14                                           etc
11 PlaceA       A1 2019-04-12      2  week 15                                           etc
12 PlaceA       A2 2018-02-01      3    start                                           etc
13 PlaceA       A2 2018-03-01      3   week 9                                           etc
14 PlaceA       A2 2018-04-07      3  week 14                                           etc
15 PlaceB       B1 2018-01-17      4    start                                           NaN
16 PlaceB       B1 2018-01-27      4   week 4 mean temp_B between 2018-01-17 and 2018-01-27
17 PlaceB       B1 2018-02-17      4   week 7                                           etc
18 PlaceB       B1 2018-03-03      4   week 9                                           etc
19 PlaceB       B1 2018-04-02      4  week 14                                           etc
20 PlaceB       B1 2018-04-25      4  week 17                                           etc
21 PlaceB       B2 2018-03-03      5    start                                           etc
22 PlaceB       B2 2018-03-18      5  week 11                                           etc
23 PlaceB       B2 2018-04-08      5  week 14                                           etc
24 PlaceB       B2 2018-04-20      5  week 16                                           etc
25 PlaceB       B2 2019-01-23      6    start                                           etc
26 PlaceB       B2 2019-02-09      6   week 6                                           etc
27 PlaceB       B2 2019-02-27      6   week 9                                           etc
28 PlaceB       B2 2019-03-12      6  week 11                                           etc
29 PlaceB       B2 2019-03-30      6  week 13                                           etc

Any help will be appreciated!

Hi @Mata. Please find below a detailed answer, just to show the detail of the approach/logic. I hope it will meet your needs. Cheers — lovalery, Nov 19 '21 at 22:32
Thank you @lovalery! I get the logic of your solution. I am working on it, to see if I can adapt it to my real dataset. I will try to translate it into a dplyr syntax though, because I am more familiar with it. — Mata, Nov 22 '21 at 17:12
Thanks for your feedback @Mata. Of course, it's up to you, you can translate the code to run your treatments with the dplyr library. I am not very familiar with dplyr joins but it seems that this library is not very easy to use for non-equal joins. I wish you the best in your work. Cheers. — lovalery, Nov 22 '21 at 22:46

Isa · Answer 1 · 2021-11-20T02:18:22.607

1

Sharing the results with temperature data of 2 places. You can always generalize the same either by joining and creating a single data object (if total places are less) or use an ifelse statement.

library(data.table)
setDT(df1)
setDT(data_A) # converting to data.table
setDT(data_B) # converting to data.table

Merged temperature to have a single data object

data_AB <- merge(data_A, data_B, by = 't')

Create a lag column of Time variable based on Place, Group_Id, exprmt

df1[,':='(LAG_DATE = shift(Time, type = 'lag')), by = .(Place, Group_Id, exprmt)]

Using apply function and user defined function to subset the temperature data based on consecutive time periods and also using data.table functionality along with lapply to get the mean for those subsets

Here I have assumed Place column can somehow be joined/mapped on some condition with the temperature data. Like in the example shared temp_A/temp_B can be formed by concatenating 'temp_' and 6th character of Place column

df1[,':='(EXPECTED = apply(cbind(LAG_DATE, Time, Place), 1, function(x) {
x1 <- as.Date(as.numeric(x[1]), origin = '1970-01-01')
x2 <- as.Date(as.numeric(x[2]), origin = '1970-01-01')
Place <- as.character(x[3])
Mean_Value <- ifelse(is.na(x1), NaN, data_AB[as.Date(t) >= x1 & 
as.Date(t) <= x2, lapply(.SD, mean), .SDcols = paste('temp_', substr(Place, 6, 
6), sep = '')])
return(as.numeric(Mean_Value))
}
))]

edited Nov 20 '21 at 02:18

answered Nov 19 '21 at 20:19

Isa

496
3
6

Thanks @`Sayed Isa` for your help! Actually, my real Place names have no pattern (city names), would there be a workaround to use no pattern in your function? Otherwise, it works great! – Mata Nov 23 '21 at 16:03
Could you please translate this part of code for me please ? `lapply(.SD, mean), .SDcols = paste('temp_', substr(Place, 6, 6), sep = '')])` How does the .SD and .SDcols command work? That you be very helpful to adapt it to my real dataset. Also, I will have several means to do for different environmental data in my real dataset. Can I adapt the whole Mean_Value inside the same function to generate as many mean_variables as needed, ot should I do that in separate functions? Thank you very much for your help! – Mata Nov 23 '21 at 16:29
Regarding the Place names utilization within the function I will need some sample values to suggest. `lapply(.SD, mean), .SDcols = paste('temp_', substr(Place, 6, 6), sep = '')])` This query actually subsets the columns and the apply the mentioned function on that data. So here .SDcols subsets the data based on the column name which is generated through the paste command. `lapply(.SD, mean)' finds mean for the column which is mentioned in .SDcols (note we can consider more than one column in .SDcols but in your case only one was needed). – Isa Nov 23 '21 at 19:25
If you can generalize the function then you can use the same function assuming rest of the things remains same. Otherwise you will have to create a new function. – Isa Nov 23 '21 at 19:30

lovalery · Accepted Answer · 2021-11-24T00:49:20.793

I suggest a detailed step-by-step solution (using data.table, lubridate and gtools libraries) which tries not to lose the reader. So, please find below a reprex.

Reprex

1. DATA PREPARATION

library(data.table)
library(lubridate)
library(gtools)

# Convert the dataframe 'df1' into data.table and add of the dummy variable 'StartTime' 
setDT(df1)[, StartTime := shift(Time,1), by = .(Place, Group_Id, exprmt)][]
setcolorder(df1, c("Place", "Group_Id", "FollowUp", "exprmt", "StartTime", "Time"))

# Convert 'StartTime' and 'Time' columns into class 'PosiXct' and into ymd_hms format 
# with the function 'ymd_TO_ymd_hms'
ymd_TO_ymd_hms <- function(x,y) as_datetime(as.double(as.POSIXct(x)+3600), tz = y)

sel_cols <- c("StartTime", "Time")
df1[, (sel_cols) := lapply(.SD, ymd_TO_ymd_hms, "GMT"), .SDcols = sel_cols][, Time := Time - 3600]

# Here is to what 'df1' looks like:
df1
#>      Place Group_Id FollowUp exprmt           StartTime                Time
#>  1: PlaceA       A1    start      1                <NA> 2018-01-14 23:00:00
#>  2: PlaceA       A1   week 5      1 2018-01-15 00:00:00 2018-02-02 23:00:00
#>  3: PlaceA       A1   week 9      1 2018-02-03 00:00:00 2018-02-26 23:00:00
#>  4: PlaceA       A1  week 10      1 2018-02-27 00:00:00 2018-03-09 23:00:00
#>  5: PlaceA       A1  week 11      1 2018-03-10 00:00:00 2018-03-17 23:00:00
#>  6: PlaceA       A1    start      2                <NA> 2019-02-01 23:00:00
#>  7: PlaceA       A1   week 9      2 2019-02-02 00:00:00 2019-02-28 23:00:00
#>  8: PlaceA       A1  week 11      2 2019-03-01 00:00:00 2019-03-14 23:00:00
#>  9: PlaceA       A1  week 13      2 2019-03-15 00:00:00 2019-03-27 23:00:00
#> 10: PlaceA       A1  ...

# Convert the dataframes 'data_A' and 'data_B' into data.tables
setDT(data_A)
setDT(data_B)

2. EXPAND ROWS OF 'df1' BY DATE RANGE USING 'StartTime' and 'Time'

df1_time_seq <- df1[!is.na(StartTime) # remove rows where StartTime = NA
                    ][ ,.(Place = Place, Group_Id = Group_Id, FollowUp = FollowUp, exprmt = exprmt, Time_seq = seq(from = StartTime, to = Time, by = "hour")), by = 1:nrow(df1[!is.na(StartTime)])]

df1_time_seq
#>       nrow  Place Group_Id FollowUp exprmt            Time_seq
#>    1:    1 PlaceA       A1   week 5      1 2018-01-15 00:00:00
#>    2:    1 PlaceA       A1   week 5      1 2018-01-15 01:00:00
#>    3:    1 PlaceA       A1   week 5      1 2018-01-15 02:00:00
#>    4:    1 PlaceA       A1   week 5      1 2018-01-15 03:00:00
#>    5:    1 PlaceA       A1   week 5      1 2018-01-15 04:00:00
#>   ---                                                         
#> 9784:   23 PlaceB       B2  week 13      6 2019-03-29 19:00:00
#> 9785:   23 PlaceB       B2  week 13      6 2019-03-29 20:00:00
#> 9786:   23 PlaceB       B2  week 13      6 2019-03-29 21:00:00
#> 9787:   23 PlaceB       B2  week 13      6 2019-03-29 22:00:00
#> 9788:   23 PlaceB       B2  week 13      6 2019-03-29 23:00:00

3. JOINS

# Merge 'data_A' and 'data_B' on 't'
data_merge <- merge(data_A, data_B, by = 't')

# Merge 'df1_time_seq' and 'data_merge' on 'Time_seq' = 't' and add a column 'temp' filled with 'temp_A' values when 'Place == PlaceA' and 'temp_B' values when 'Place == PlaceB'
df1_time_seq_merge <- merge(df1_time_seq, data_merge, by.x = "Time_seq", by.y = "t")[, temp :=  fcase(Place == "PlaceA", temp_A,
                                                                                                      Place == "PlaceB", temp_B)
                                                                                     ][, `:=` (temp_A = NULL, temp_B = NULL)
                                                                                       ][]
df1_time_seq_merge
#>                  Time_seq nrow  Place Group_Id FollowUp exprmt      temp
#>    1: 2018-01-15 00:00:00    1 PlaceA       A1   week 5      1 10.618465
#>    2: 2018-01-15 01:00:00    1 PlaceA       A1   week 5      1 16.156850
#>    3: 2018-01-15 02:00:00    1 PlaceA       A1   week 5      1  6.806842
#>    4: 2018-01-15 03:00:00    1 PlaceA       A1   week 5      1 21.036855
#>    5: 2018-01-15 04:00:00    1 PlaceA       A1   week 5      1 21.578569
#>   ---                                                                   
#> 9784: 2019-04-11 18:00:00    9 PlaceA       A1  week 15      2 16.646570
#> 9785: 2019-04-11 19:00:00    9 PlaceA       A1  week 15      2 12.362436
#> 9786: 2019-04-11 20:00:00    9 PlaceA       A1  week 15      2 24.853746
#> 9787: 2019-04-11 21:00:00    9 PlaceA       A1  week 15      2 22.553074
#> 9788: 2019-04-11 22:00:00    9 PlaceA       A1  week 15      2 21.020600

4. SUMMARIZE 'df1_time_seq_merge'

# Summarize df1_time_seq_merge to get the mean of 'temp' by group in the 'expected' variable
df1_mean <- df1_time_seq_merge[, .(expected = mean(temp)), by = .(Place, Group_Id, exprmt, FollowUp)]

df1_mean
#>      Place Group_Id exprmt FollowUp expected
#>  1: PlaceA       A1      1   week 5 15.17243
#>  2: PlaceB       B1      4   week 4 19.26662
#>  3: PlaceB       B1      4   week 7 17.32940
#>  4: PlaceA       A2      3   week 9 14.92409
#>  5: PlaceA       A1      1   week 9 14.86734
#>  6: PlaceB       B1      4   week 9 18.36255
#>  7: PlaceA       A1      1  week 10 14.75482
#>  8: PlaceA       A2      3  week 14 14.86063
#>  9: PlaceB       B1      4  week 14 17.35101
#> 10: PlaceB       B2      5  week 11 17.93565
#> 11: PlaceA       A1      1  week 11 14.86273
#> 12: PlaceB       B2      5  week 14 16.77532
#> 13: PlaceB       B1      4  week 17 18.00866
#> 14: PlaceB       B2      5  week 16 18.15545
#> 15: PlaceB       B2      6   week 6 17.95428
#> 16: PlaceA       A1      2   week 9 14.96347
#> 17: PlaceB       B2      6   week 9 16.85704
#> 18: PlaceB       B2      6  week 11 17.23744
#> 19: PlaceA       A1      2  week 11 15.22046
#> 20: PlaceB       B2      6  week 13 17.33922
#> 21: PlaceA       A1      2  week 13 14.58677
#> 22: PlaceA       A1      2  week 14 15.24341
#> 23: PlaceA       A1      2  week 15 15.87080
#>      Place Group_Id exprmt FollowUp expected

5. FINAL JOIN BETWEEN 'df1' AND 'df1_MEAN'

DF_Results <- merge(df1, df1_mean, by = c("Place", "Group_Id", "exprmt", "FollowUp"), all.x = TRUE)[, Time := Time + 3600][]

6. CLEANING 'DF_Results' TO GET THE DESIRED OUTPUT

ymd_hms_TO_ymd <- function(x) as_date(as.POSIXct(x))

DF_Results[, `:=` (StartTime = NULL, Time = lapply(Time, ymd_hms_TO_ymd))]

setcolorder(DF_Results, c("Place", "Group_Id", "exprmt", "Time", "FollowUp", "expected"))

DF_Results <- DF_Results[gtools::mixedorder(FollowUp, decreasing = FALSE)]
setorder(DF_Results, Place, Group_Id, exprmt)

DF_Results
#>      Place Group_Id exprmt       Time FollowUp expected
#>  1: PlaceA       A1      1 2018-01-15    start       NA
#>  2: PlaceA       A1      1 2018-02-03   week 5 15.17243
#>  3: PlaceA       A1      1 2018-02-27   week 9 14.86734
#>  4: PlaceA       A1      1 2018-03-10  week 10 14.75482
#>  5: PlaceA       A1      1 2018-03-18  week 11 14.86273
#>  6: PlaceA       A1      2 2019-02-02    start       NA
#>  7: PlaceA       A1      2 2019-03-01   week 9 14.96347
#>  8: PlaceA       A1      2 2019-03-15  week 11 15.22046
#>  9: PlaceA       A1      2 2019-03-28  week 13 14.58677
#> 10: PlaceA       A1      2 2019-04-04  week 14 15.24341
#> 11: PlaceA       A1      2 2019-04-11  week 15 15.87080
#> 12: PlaceA       A2      3 2018-02-01    start       NA
#> 13: PlaceA       A2      3 2018-03-01   week 9 14.92409
#> 14: PlaceA       A2      3 2018-04-06  week 14 14.86063
#> 15: PlaceB       B1      4 2018-01-17    start       NA
#> 16: PlaceB       B1      4 2018-01-27   week 4 19.26662
#> 17: PlaceB       B1      4 2018-02-17   week 7 17.32940
#> 18: PlaceB       B1      4 2018-03-03   week 9 18.36255
#> 19: PlaceB       B1      4 2018-04-01  week 14 17.35101
#> 20: PlaceB       B1      4 2018-04-24  week 17 18.00866
#> 21: PlaceB       B2      5 2018-03-03    start       NA
#> 22: PlaceB       B2      5 2018-03-18  week 11 17.93565
#> 23: PlaceB       B2      5 2018-04-07  week 14 16.77532
#> 24: PlaceB       B2      5 2018-04-19  week 16 18.15545
#> 25: PlaceB       B2      6 2019-01-23    start       NA
#> 26: PlaceB       B2      6 2019-02-09   week 6 17.95428
#> 27: PlaceB       B2      6 2019-02-27   week 9 16.85704
#> 28: PlaceB       B2      6 2019-03-12  week 11 17.23744
#> 29: PlaceB       B2      6 2019-03-30  week 13 17.33922
#>      Place Group_Id exprmt       Time FollowUp expected

^{Created on 2021-11-24 by the reprex package (v2.0.1)}

Hi @lovalery! I am working with the solution you provided. I actually have a difference in my expected values. For example, `mean(data_A[325:804,2])` returns 15.24652 instead of 10.618465 in your solution. I feel it comes from `DF_join_1`: it has 12746 rows versus 7250 in `data_merge`. It probably comes from duplicates as `length(unique(DF_join_1$StartTime))` returns 7250. I don't really understand the syntax for generating `DF_join_1`: is the `on`doing the mean? So I don't really know how to fix this. Thanks for your help! — Mata, Nov 23 '21 at 15:54
I think I got it: 10.618465 is the temperature in Place A on 2018-01-15 00:00:00. It does not take the mean. Any help on how to fix it in your solution would be appreciated :) Thanks for your help! — Mata, Nov 23 '21 at 16:21
Hi @Mata. Thanks for your feedback and sorry for this mistaken code. I will try to find out where the problem lies. Just a few clarifications to your comments: (i) the computation of the mean you perform does not seem right either because in the join `Time > t` . So you should only include the values up to the date `2018-02-02 23:00:00` and not up to `2018-02-03 23:00:00` based on your selection `data_A[325:804,]` . Does it make sense for you too? — lovalery, Nov 23 '21 at 16:56
(ii) in the `DF_join_1` join there is no computation of mean : `on = ` simply lists (the `.` stands for `list()`) the columns to be taken into account for the join. That said, the code I proposed is incorrect and the game is to find out where the error has crept in! — lovalery, Nov 23 '21 at 16:56
No problem, don't apologize, thanks for helping! For (i): well, actually both would be fine for me from a biological point of view (it won't make much of a difference and I can argue for both means in my downstream analysis). But maybe including values up to the date `2018-02-02 23:00:00` would be easier, because it will avoid overlapping the means over consecutive intervals. For info, if it helps: `mean(data_A[325:780,2])` returns 15.17243. Just as a way to check the output. — Mata, Nov 23 '21 at 17:09
Thank you for your kind words. I will do my best to find the solution. And thank you for the reference value because it will help me to avoid proposing you a new wrong code ;-) — lovalery, Nov 23 '21 at 17:19
Hi @Mata. That's it, I think I found the solution :-) I have completely modified my answer above... and I get the famous result you gave me (i.e. `15.17243`)!!!!. I made it still step by step with intermediate tables to make it easier to follow the logic of the reasoning. If something is not clear, don't hesitate to ask me. If my answer meets your needs, don't hesitate to mark it as "validated" ;-) If not, feel free to tell me what still doesn't work. Cheers. — lovalery, Nov 24 '21 at 00:28
I wen't through your answer in parallel with my real dataset, so it took me more time. Thanks, that works well! Two notes. 1) in point 6, I could skip the function, it works well without. 2) Up to point 3, I could use a `dplyr` syntax (easier for me) while following your logic. But I didn't find one to generate `df1_time_seq_merge`. Your data.table solution proved very efficient there! Thanks for your help! — Mata, Nov 25 '21 at 16:44
Hi @Mata, thanks so much for your feedback and for validating the answer. Glad I could help you. Cheers. — lovalery, Nov 25 '21 at 17:00

Include in data frame 1 averaged values from several other data frames and based on varying time intervals

2 Answers2