4

This is a part of my dataset, which has similar data for 12 years and I want to calculate the cumulative sum of when the outflow changes from 0 to a value and store it as one event. I then want to find the maximum, minimum, mean and standard deviation of the event and group it by each year.

For example, in the dataset below, there are the two events. What would be the sum of each of those events and what would be the max, min, mean and SD of the two events?

Date           Outflow
2/2/2014    0
2/2/2014    0
2/2/2014    0.073
2/2/2014    0.096
2/2/2014    0.096
2/2/2014    0.096
2/2/2014    0.015
2/2/2014    0.007
2/2/2014    0.003
2/2/2014    0.002
2/2/2014    0.001
2/2/2014    0.001
2/3/2014    0
2/3/2014    0
2/3/2014    0
2/3/2014    0
2/3/2014    0.022
2/3/2014    0.031
2/3/2014    0.034
2/3/2014    0.036
2/3/2014    0.037
2/3/2014    0.038
2/3/2014    0.038
2/3/2014    0.038
2/3/2014    0.038
2/3/2014    0.104

I have been able to count the number of events for each year but I haven't been able to calculate the sum of each of those events.

I have included code I used to calculate the number of event for each year. LID_yearly is when the dataset is split by year and the code is run.

    m=rep(0,12) ## intialize a counter
    x="no_value"

    for(k in 1:length(LID_yearly)){
    for(i in 1:length(LID_yearly[[k]]$`Surface Runoff`)){
      if(LID_yearly[[k]]$`Surface Runoff`[i]==0){
        m[k]=m[k]
        x="no_value"
      } 

      else if((LID_yearly[[k]]$`Surface Runoff`[i]!=0)&&(x=="no_value")){
        m[k]=m[k]+1
        x="value"
      } else if((LID_yearly[[k]]$`Surface Runoff`[i]!=0)&&(x=="value")){
        m[k]=m[k]
        x="value"
      }
    }
  }

I want to get a column for each statistic I am looking for (eg. max, min etc.) for the events is each year.

This is my current output that I have.

"Year   Sum Max     Min    Mean   SD    No.of inflow  No. of Outflow

"2007   0   0       NA     NA     NA    122           0"
"2008   0   0       NA     NA     NA    106           0"
"2009   0   0       NA     NA     NA    149           0"
"2010   0   0       NA     NA     NA    122           0"
"2011   0   0       NA     NA     NA    125           0"
"2012   0   0       NA     NA     NA    82            0"
"2013   10  0.226   0.001  0.092  0.077 125           5"
"2014   46  0.325   0.001  0.078  0.073 149           14"
"2015   56  0.26    0.001  0.087  0.069 153           15"
"2016   21  0.135   0.001  0.036  0.029 136           15"
"2017   11  0.22    0.002  0.086  0.068 148           4"
"2018   0   0       NA     NA     NA    138           0"

Here the last two columns are counting the number of events but the statistics are not what I am looking for as it calculates the mean or max of every single value rather than as an event.

Here is the code I used using dplyr:

  LIDTDFNEW <- LIDnew %>% group_by(year) %>% summarise("Sum of Bypass 
  Volume"=sum(`Surface Runoff`), 
  "Maximum Bypass Volume"=max(`Surface Runoff`), 
  "Minimum Byass Volume"=(if (!all(`Surface Runoff`==0)) min(`Surface 
  Runoff`[`Surface Runoff`!=0])else NA),
  "Average Bypass Volume"= (if(!all(`Surface Runoff`==0)) mean(`Surface 
  Runoff`[`Surface Runoff`!=0])else NA),
  "SD of Byass Volume"=(if (!all(`Surface Runoff`==0)) sd(`Surface 
  Runoff`[`Surface Runoff`!=0])else NA))

2 Answers2

1

First we generate an event column using cut() applied to == 0-binarized Outflow.

df2$event <- 
  cut(1:nrow(df2), breaks=c(1, which(c(0, diff(df2$Outflow == 0)) != 0), nrow(df2)) - 1)

(Note: This looks more complicated than it is; we just need to add the first row 1 and the last row nrow(df2) to the breaks.)

0-values we can set to NA:

df2$event[df2$Outflow == 0] <- NA

Now we re-level a little to get a nice event column later.

df2$event <- as.numeric(droplevels(df2$event))

Finally — to get event-wise summaries — we use aggregate() to apply all the summary functions.

res1 <- do.call(data.frame, 
        aggregate(Outflow ~ event + Date, df2, function(x) 
  c(sum=sum(x), max=max(x), min=min(x), mean=mean(x), sd=sd(x)))
  )
res1
#   event     Date Outflow.sum Outflow.max Outflow.min Outflow.mean  Outflow.sd
# 1     1 2/2/2014       0.390       0.096       0.001   0.03900000 0.044790872
# 2     2 2/3/2014       0.312       0.038       0.022   0.03466667 0.005315073

Edit

If you rather want a yearly summary, a year column could be convenient which we can obtain with strptime().

df2$year <- substr(strptime(df2$Date, format="%d/%m/%Y"), 1, 4)

Now, first calculate sums, then bind the min and max to an aggregate() over the years.

sums <- aggregate(Outflow ~ event + Date + year, df2, sum)$Outflow
res2 <- cbind(max=max(sums), min=min(sums),
              do.call(data.frame, aggregate(Outflow ~ year, df2, function(x) 
                c(sum=sum(x), mean=mean(x), sd=sd(x)))))

res2
#    max   min year Outflow.sum Outflow.mean Outflow.sd
# 1 0.39 0.312 2014       0.806        0.031  0.0347206

This also should scale up to several years.

Note: I wrapped a do.call(data.frame, .) around the aggregates to get a "data.frame" object with clean columns (i.e. vectors rather than matrices).

Data

df2 <- structure(list(Date = c("2/2/2014", "2/2/2014", "2/2/2014", "2/2/2014", 
"2/2/2014", "2/2/2014", "2/2/2014", "2/2/2014", "2/2/2014", "2/2/2014", 
"2/2/2014", "2/2/2014", "2/3/2014", "2/3/2014", "2/3/2014", "2/3/2014", 
"2/3/2014", "2/3/2014", "2/3/2014", "2/3/2014", "2/3/2014", "2/3/2014", 
"2/3/2014", "2/3/2014", "2/3/2014", "2/3/2014"), Outflow = c(0, 
0, 0.073, 0.096, 0.096, 0.096, 0.015, 0.007, 0.003, 0.002, 0.001, 
0.001, 0, 0, 0, 0, 0.022, 0.031, 0.034, 0.036, 0.037, 0.038, 
0.038, 0.038, 0.038, 0.104)), row.names = c(NA, -26L), class = "data.frame")
Community
  • 1
  • 1
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    It will work only if there is only one event (increase from 0) per date. – tmfmnk Sep 18 '19 at 19:00
  • @tmfmnk Thanks for the wake-up call, finally more challenging than expected :) See update. – jay.sf Sep 18 '19 at 19:51
  • In your aggregated data frame, what I am expecting is the maximum between two events; so in the case above the maximum would be 0.390 and the minimum would be 0.312 for the year 2014. – Alfy Joseph Sep 18 '19 at 20:54
  • @AlfyJoseph Ok, see update. It would be easier to get what you want if you add a detailed expected output to your questions in future. – jay.sf Sep 18 '19 at 21:40
  • Sure Thanks. Your answer worked out well. Instead of cbind I used group_by to summarize from the calculated sums. Thank you. – Alfy Joseph Sep 19 '19 at 00:31
  • @AlfyJoseph Glad I could help you. – jay.sf Sep 19 '19 at 04:27
1

Here is a dplyr solution:

df %>%
  mutate(events = case_when(outflow != 0 ~ 1, TRUE ~ NA_real_)) %>% # identify events
  mutate(events = rank(events, ties.method = "first")) %>% # add sequence
  mutate(events = case_when(outflow == 0 ~ NA_integer_, lag(outflow) == 0 ~ as.integer(events), TRUE ~ as.integer(lag(events))) # same event, same number
group_by(events) %>% # Group by event
summarise(sum_outflow = sum(outflow, na.rm = TRUE), min_outflow = min(outflow, na.rm = TRUE), #etc.)

This is a bit of a hacky version and you could certainly tidy it up to have more logical names, sequencing for the events.

The group_by afterwards should be fine as is, however.

Edit:

This does not work for if the event is longer than 3 rows, I have to come up with a better solution.

Fnguyen
  • 1,159
  • 10
  • 23