Remove all values in a month where any NA values occur within that month

Question

I have a time series of daily precipitation values starting in 1838 and finishing in 2001. There are some missing days in the record that are coded with an NA. My data takes the form:

Year Month Day PPT

1838   1    1   0.0
1838   1    2   NA
1838   1    3   1.3
1838   1    4   0.6

I am now using an aggregate function to sum the daily precipitation totals into monthly means per year of the record using:

aggregate(PPT~Year, df, sum)

This works fine, but the problem is that some of the monthly aggregated sums are not genuine. For example, if there were 20 missing days in the month of January 1838 then the aggregated sum for that month would not be genuine.

What I want to do is for my aggregate function not to return aggregated sums for any months containing missing days (NA), i.e. if even just 1 day is missing I want the month not to be returned. Alternatively I could remove any months containing any NA values before aggregating.

This creates a single NA value and unfortunately does not help with this situation. — DJ-AFC, May 12 '16 at 09:50
It was just a tip, we could aggregate by YearMonth to remove sums from months that are not *genuine*, then aggregate by Year. — zx8754, May 12 '16 at 09:58
Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. — zx8754, May 12 '16 at 09:58

ArunK · Answer 1 · 2016-05-12T10:06:37.530

1

This could potentially work using the dplyr library The data I've used is as follows, I've added an additional row with a different month to show the results correctly

structure(list(year = c(1838, 1838, 1838, 1838, 1838), month = c(1, 1, 1, 1,2), day = c(1, 2, 3, 4, 1), ppt = c(0, NA, 1.3, 0.6, 0.1)), .Names = c("year", "month", "day", "ppt"), row.names = c(NA, -5L), class = "data.frame")

df %>% group_by(year,month) %>% summarise(s = sum(ppt))

The results of this are as follows, where s is the sum of all values within that month

  year month     s
 (dbl) (dbl) (dbl)
1  1838     1    NA
2  1838     2   0.1

edited May 12 '16 at 10:06

answered May 12 '16 at 09:24

ArunK

1,731
16
35

This just produces an additional column in the dataset called s with lots of numbers that make no sense. – DJ-AFC May 12 '16 at 09:49
Thank you so much - this works great. Stupid question probably, but how do I make this a data frame? At the moment this prints the solution but I cannot then do anything further with it unless it's a named object such as a data frame. – DJ-AFC May 12 '16 at 10:12
new_df <- df %>% group_by(year,month) %>% summarise(s = sum(ppt)) – ArunK May 12 '16 at 10:13
Is there any way I could embed an if statement into the above code in order to group PPT only when PPT exceeds a certain threshold (say 0.2mm)? – DJ-AFC May 12 '16 at 14:10
just before the summarise command you can apply a filter.. df %>% group_by(year,month) %>% filter(ppt>0.2) %>% summarise(s=sum(ppt)) – ArunK May 12 '16 at 14:52

score 1 · Accepted Answer · answered May 12 '16 at 10:02

df_precipitation                 <- data.frame(
  Year = rep(x = 2016, times = 61),
  Month = rep(x = c(5, 6), times = c(31, 30)),
  Day = c(1:31, 1:30),
  PPT = c(NA, sample(x = 100, size = 60))
)

df_aggregated                    <- aggregate(
  formula = PPT ~ Year + Month, 
  data = df_precipitation,
  FUN = sum,
  na.action = na.pass
)

df_aggregated                    <- na.omit(object = df_aggregated)

The code creates the data.frame with some mock data and does the aggregation by Year and Month with na.action = na.pass

If any month has a NA value for precipitation, na.pass ensures the sum() function gets the NA value too. And by default sum() function returns NA if any of the input values are NA

Then, we do na.omit which throws away the rows with the sum as NA

Which makes the aggregate function not to return aggregated sums for any months containing missing days

Remove all values in a month where any NA values occur within that month

2 Answers2