In R, is it possible to include the same row in multiple groups, or is there other workaround?

Question

I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.

Here's an example to give a more concrete idea:

library(MESS)
library(lubridate)
library(dplyr)

Generate Reproducible Example

datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))

useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
  dat <- left_join(dat, useDate)

Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.

  out <- dat %>%
      mutate(lagDateGood = lag(DateGood),
             leadDateGood = lead(DateGood)) %>%
      filter(lagDateGood != "No" | leadDateGood != "No")

Now I need to calculate the area under the curve - this is not correct

out2 <- out %>%
    group_by(day) %>%
    mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
    summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))

The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.

I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?

If you want a value to be in multiple groups, then you need to duplicate the value in the input data set. I suggest you create a second table containing the 'last' value from each day, and push the date forward so I matches the following day. It would make sense to add a new column to indicate this special case, then rbind this data set to your original data set. — Alex Brown, Nov 18 '15 at 19:32
Thanks for the comment, Alex. I'm not sure I follow you. I think I could do this, but I'm not sure it makes the most sense. The time at which the measurement happened on the previous day is important to calculate the spline correctly. I'm afraid just moving it to the next day would lose this information. — Tedward, Nov 18 '15 at 19:56
So add a new column that just contains the group identifier ( rounded day for which the calculation is performed). Push the group is of the last reading of the previous day forwards so it can be collected with the raw readings of the following day. — Alex Brown, Nov 18 '15 at 20:01
I think I see what you mean. I can use two columns to describe the groups, then summarize for each, and the rbind results. Doing this might be a bit trickier than your comment suggests, I think. — Tedward, Nov 18 '15 at 20:20
Here's a related question I posted: http://stackoverflow.com/questions/33789627/in-r-split-a-dataframe-so-subset-dataframes-contain-last-row-of-previous-datafr — Tedward, Nov 18 '15 at 20:29
I could also duplicate rows that are in multiple groups (used for multiple days) and reassign their group. — Tedward, Nov 18 '15 at 21:39
Yes, that's what I mean; duplication A1A2A3B1B2C1 grouped as AAABBC and then add duplicates for A3B2 grouped as BC. — Alex Brown, Nov 18 '15 at 21:49

score 0 · Answer 1 · answered Nov 18 '15 at 21:51

For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:

#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)

calcTotalFlux <- function(df) {
    if (nrow(df) < 10) {              # make sure the day has at least 10 measures
        NA
    } else {
    day_midnight <- floor_date(df$datetime[2], "day")
    df %>%
    mutate(time = datetime - day_midnight) %>%
    summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}

do.call("rbind",lapply(res, calcTotalFlux))

    TotalFlux
7         NA
8   585230.2
9   579017.3
10        NA
11  563689.7
12        NA

score 0 · Answer 2 · answered Nov 18 '15 at 22:35

Here's another way. More in line with the suggestions of @Alex Brown.

 # Another way
last <- out %>%
    group_by(day) %>%
    filter(datetime == max(datetime)) %>%
    ungroup() %>%
    mutate(day = day + 1)

first <- out %>%
    group_by(day) %>%
    filter(datetime == min(datetime)) %>%
    ungroup() %>%
    mutate(day = day - 1)

d <- rbind(out, last, first) %>%
    group_by(day) %>%
    arrange(datetime)

n_measures_per_day <- d %>%
    summarize(n = n())

d <- left_join(d, n_measures_per_day) %>%
    filter(n > 4)

TotalFluxDF <- d %>%
    mutate(timeAtMidnight = floor_date(datetime[3], "day"),
           time = datetime - timeAtMidnight) %>%
    summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))

TotalFluxDF

Source: local data frame [3 x 2]

    day      auc
  (dbl)    (dbl)
1     8 585230.2
2     9 579017.3
3    11 563689.7

In R, is it possible to include the same row in multiple groups, or is there other workaround?

2 Answers2

Linked