0

Introduction

Summary:

  • Trying to average data by season (when necessary) when certain conditions are met.

Hello everyone.

I am currently working with numerous large data sets (>200 sets with >5000 rows each) of long-term time series data collection for multiple variables across different locations. So far, I've extracted data into separate CSV files per site and per station.

For the most part, the data reported per parameter is one instance per season.

Season here is defined ecologically as DJF, MAM, JJA, SON for months corresponding to Winter, Spring, Summer, and Fall respectively.

However, there are some cases where there were multiple readings during a seasonal event. Here, the parameter values and dates have to be averaged; this is before further analysis can take place on these data sets.

To complicate things even further, some of the data is marked by a Greater Than or Less Than (GTLT) symbol). In these cases, values and dates are not averaged unless the recorded value is the same.


Data Example

Summary:

  • Code and Tables show requested changes in data-set

So, for a data-driven example...

Here's a few rows from a data set.

Data.Example<-structure(list(
    Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402), 
    End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"), 
    Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
    GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
    Value = c(10, 10, 20, 30, 15, 10)), 
    .Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")

This is ideally what I would like as output

Data.Example.New<-structure(list(
    Station.ID.new = c(13402, 13402, 13402, 13402), 
    End.Date.new = structure(c(2L, 3L, 2L, 1L), .Label = c("10/13/2016", "7/28/2016", "8/15/2016"), class = "factor"), 
    Parameter.Name.new = structure(c(2L, 2L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"), 
    GTLT.new = structure(c(2L, 2L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
    Value.new = c(10, 20, 22.5, 10)), 
    .Names = c("Station.ID.new", "End.Date.new", "Parameter.Name.new", "GTLT.new", "Value.new"), row.names = c(NA, -4L), class = "data.frame")

Here, the following things are occurring:

  • For Enterococci measured in July and Aug 13, there is a GTLT symbol, but Value for both == 10. So average dates. New row is 7/28/2016 and Value 10.
  • While Enterococci on Aug 15 is within same season as other values, since GTLT value is different, it would only be averaged in same season of same year with other values of 20. In this case, since it is only one where Value==20, that row does not change and is repeated in final data frame.
  • Alkalinity in July and August are same season, so average dates (7/28/16) and Value (22.5) in new row.
  • Alkalinity in October is different season, so keep row.
  • All other data (such as Station.ID and Parameter.Name) should just be copied since they shouldn't differ here.

If for some reason you have a GTLT and non-GTLT for same parameter:

End.Date    GTLT    Value    Parameter
7/13/2015     <      10         Alk
7/13/2016     <      10         Alk
8/13/2016            10         Alk
8/15/2016            20         Alk

Then final result would be

End.Date    GTLT    Value    Parameter
7/13/2015     <      10         Alk
7/13/2016     <      10         Alk
8/14/2016            15         Alk

Approach

Summary:

  • Define seasons and then aggregate using package like dplyr?
  • Create loop function to read row by row (after sort by Parameter.Name then Date?)

As one might expect, this is where I'm stuck.

I know seasons can be defined in R from prior Stack Q's:

New vector of seasons based on dates

And I know that average/aggregation packages such as dplyr (and possibly zoo?) can do chaining commands.

My issue is putting this thought process into code that can be repeated for each data set.

I'm not sure if that's the best approach (define seasons and then set conditions for averaging data), or if some sort of loop function would work here by going through row by row of the data set post-sort by Parameter.Name then End.Date.

I quickly sketched my thoughts on what some sort of loop function would have to include:

Rough idea of flow diagram

Note, you can't just average starting row [i] and [i+1] because [i+2], etc. might need averaged as well. Hence finding row [i+n] that breaks loop before last step, averaging all prior rows [i+n-1], and moving on to next new row [i+n].

Further, as clarification, the season would have to be within season of that annual cycle. So 7/13/2016 == 8/13/2016 for same season. 12/12/2015 == 01/01/2016 for same season. But 4/13/2016! == 4/13/2015 in regards to averaging.


Conclusion and Summary

In short, I need help designing code to average individual parameter time-series values by annual season with specific exceptions for multiple large data sets.

I'm not sure of the best approach in designing code to do this, whether it's a large loop function or a combination of code and specialized chaining-enabled packages.

Thank you for your time in advance.

Cheers,

soccernamlak

Community
  • 1
  • 1
  • I think you got a downvote because the question is very long. You could probably have asked this question in a shorter, simpler way. That said, you were clear and provided sample data, so I think you're okay. I wanted to make sure I understand a certain point you make. You say `12/12/2015 == 01/01/2013`, but `4/13/2016 != 4/13/2015`. Are you sure you didn't mean `12/12/2015 == 01/01/*2015*`? – jmartindill Oct 17 '16 at 20:39
  • I do apologize for the length. In regards to the date, I've edited the primary post, but I actually meant `12/12/2015 == 01/01/2016` In short, December 2015 and January 2016 are considered the same season ecologically. – soccernamlak Oct 17 '16 at 21:29

1 Answers1

0

Using dplyr and lubridate I was able to come up with a solution. My output matches your example output, except I did not keep the exact dates, which I felt were misleading in the final result.

Data.Example<-structure(list(
  Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402), 
  End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"), 
  Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
  GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
  Value = c(10, 10, 20, 30, 15, 10)), 
  .Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")

# Create season key
seasons <- data.frame(month = 1:12, season = c(rep("DJF",2), rep("MAM", 3), rep("JJA", 3), rep("SON",3), "DJF"))

# Isolate Month and Year, create Season column
Data.Example$Month <- lubridate::month(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Year <- lubridate::year(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Season <- seasons$season[Data.Example$Month]

# Update 'year' where month = December so that it is grouped with Jan and Feb of following year
Data.Example$Year[Data.Example$Month == 12] <- Data.Example$Year[Data.Example$Month == 12]+1

# Find out which station/year/season/paramaters have at least one record with a GTLT
GTLT.Test<- Data.Example %>% 
  group_by(Station.ID, Year, Season, Parameter.Name) %>%
  summarize(has_GTLT = max(nchar(as.character(GTLT))))

# First only calculate averages for groups without any GTLT
Data.Example.New1 <- Data.Example %>% 
  anti_join(GTLT.Test[GTLT_test$has_GTLT == 1,], 
            by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
  group_by(Station.ID, Year, Season, Parameter.Name, GTLT) %>%
  summarize(Value.new = mean(Value))

# Now do the same for groups with GTLT, only combining when values and GTLT symbols match.
Data.Example.New2 <- Data.Example %>% 
  anti_join(GTLT.Test[GTLT_test$has_GTLT == 0,], 
            by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
  group_by(Station.ID, Year, Season, Parameter.Name, GTLT, Value) %>%
  summarize(Value.new = mean(Value)) %>%
  select(-Value)

# Combine both
Data.Example.New <- rbind(Data.Example.New1, Data.Example.New2)

EDIT: I just noticed you linked to another SO question for converting dates to seasons. Mine simply converts by month, not date, and does not use actual seasons. I did this because in your example, Dec. 12 matches with Jan. 1. December 12 is technically fall, so I assumed you weren't using actual seasons, but were instead using four three-month groupings.

jmartindill
  • 260
  • 1
  • 8
  • First, I sincerely appreciate your time and help. In regards to converting dates to seasons, your approach was the one I was looking for: three month groupings and not the technical seasonality dates. – soccernamlak Oct 17 '16 at 21:30
  • 1
    I have two followup questions in regards to your code. First, in regards to defining seasons by groupings, would that not make Jan. 2010 and Dec. 2010 considered the same season? If so, then my guess is the easiest way would be to just add a few lines in where if the month==12, then add +1 to the year (unless you can think of easier way) – soccernamlak Oct 17 '16 at 21:34
  • Second, In regards to using exact dates vs. average date vs. three month season code, I think (at least) part of the reason for doing so is to make it easier on sort + plotting. One way I think I can get around that is making the grouping codes start with a number (01DJF, 02MAM, etc.). I need to check with the Principal Investigators on this project to see if that's fine; I do agree with you in regards to why excluding the date would make the most sense. – soccernamlak Oct 17 '16 at 21:35
  • I added a line that fixes the December year issue. – jmartindill Oct 17 '16 at 22:26
  • Regarding the grouping codes, would you want all records in a given season to have the same codes, and for each season to have its own code? If so, having the season and year might make sense, e.g. ("2015-1", "2015-2", "2015-3", "2015-4", "2016-1", "2016-2", "2016-3", ...) – jmartindill Oct 17 '16 at 22:29
  • Most likely; seems to make the most sense in regards to data analysis. Especially when plotting. I figure easiest way was using `paste` for Year and Season column in that order. That way sort would work in regards to time. – soccernamlak Oct 18 '16 at 06:02