Introduction
Summary:
- Trying to average data by season (when necessary) when certain conditions are met.
Hello everyone.
I am currently working with numerous large data sets (>200 sets with >5000 rows each) of long-term time series data collection for multiple variables across different locations. So far, I've extracted data into separate CSV files per site and per station.
For the most part, the data reported per parameter is one instance per season.
Season here is defined ecologically as DJF, MAM, JJA, SON for months corresponding to Winter, Spring, Summer, and Fall respectively.
However, there are some cases where there were multiple readings during a seasonal event. Here, the parameter values and dates have to be averaged; this is before further analysis can take place on these data sets.
To complicate things even further, some of the data is marked by a Greater Than or Less Than (GTLT) symbol). In these cases, values and dates are not averaged unless the recorded value is the same.
Data Example
Summary:
- Code and Tables show requested changes in data-set
So, for a data-driven example...
Here's a few rows from a data set.
Data.Example<-structure(list(
Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402),
End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"),
Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value = c(10, 10, 20, 30, 15, 10)),
.Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")
This is ideally what I would like as output
Data.Example.New<-structure(list(
Station.ID.new = c(13402, 13402, 13402, 13402),
End.Date.new = structure(c(2L, 3L, 2L, 1L), .Label = c("10/13/2016", "7/28/2016", "8/15/2016"), class = "factor"),
Parameter.Name.new = structure(c(2L, 2L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT.new = structure(c(2L, 2L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value.new = c(10, 20, 22.5, 10)),
.Names = c("Station.ID.new", "End.Date.new", "Parameter.Name.new", "GTLT.new", "Value.new"), row.names = c(NA, -4L), class = "data.frame")
Here, the following things are occurring:
- For Enterococci measured in July and Aug 13, there is a GTLT symbol, but Value for both == 10. So average dates. New row is 7/28/2016 and Value 10.
- While Enterococci on Aug 15 is within same season as other values, since GTLT value is different, it would only be averaged in same season of same year with other values of 20. In this case, since it is only one where Value==20, that row does not change and is repeated in final data frame.
- Alkalinity in July and August are same season, so average dates (7/28/16) and Value (22.5) in new row.
- Alkalinity in October is different season, so keep row.
- All other data (such as Station.ID and Parameter.Name) should just be copied since they shouldn't differ here.
If for some reason you have a GTLT and non-GTLT for same parameter:
End.Date GTLT Value Parameter
7/13/2015 < 10 Alk
7/13/2016 < 10 Alk
8/13/2016 10 Alk
8/15/2016 20 Alk
Then final result would be
End.Date GTLT Value Parameter
7/13/2015 < 10 Alk
7/13/2016 < 10 Alk
8/14/2016 15 Alk
Approach
Summary:
- Define seasons and then aggregate using package like
dplyr
? - Create loop function to read row by row (after sort by Parameter.Name then Date?)
As one might expect, this is where I'm stuck.
I know seasons can be defined in R from prior Stack Q's:
New vector of seasons based on dates
And I know that average/aggregation packages such as dplyr
(and possibly zoo
?) can do chaining commands.
My issue is putting this thought process into code that can be repeated for each data set.
I'm not sure if that's the best approach (define seasons and then set conditions for averaging data), or if some sort of loop function would work here by going through row by row of the data set post-sort by Parameter.Name then End.Date.
I quickly sketched my thoughts on what some sort of loop function would have to include:
Note, you can't just average starting row [i] and [i+1] because [i+2], etc. might need averaged as well. Hence finding row [i+n] that breaks loop before last step, averaging all prior rows [i+n-1], and moving on to next new row [i+n].
Further, as clarification, the season would have to be within season of that annual cycle. So 7/13/2016 == 8/13/2016 for same season. 12/12/2015 == 01/01/2016 for same season. But 4/13/2016! == 4/13/2015 in regards to averaging.
Conclusion and Summary
In short, I need help designing code to average individual parameter time-series values by annual season with specific exceptions for multiple large data sets.
I'm not sure of the best approach in designing code to do this, whether it's a large loop function or a combination of code and specialized chaining-enabled packages.
Thank you for your time in advance.
Cheers,
soccernamlak