1

So I have lots of data in the form (4 values for each day)

  date       var1      var2 
1  2003-10-28    1.2       970     
2  2003-10-28     NA       510     
3  2003-10-28     NA       640     
4  2003-10-28     NA       730     
5  2003-10-30    2.0       570     
6  2003-10-30     NA       480     
7  2003-10-30    1.2       580     
8  2003-10-30    1.2       297     
9  2002-05-07    3.0       830     
10 2002-05-07    4.8       507     
11 2002-05-07    4.8       253     
12 2002-05-07     NA       798     

and I need to calculate sums for var1 for every day, IF there is for example less than 2 NA values (or none) for that specific date and otherwise that date should be ignored. At the same time I should calculate means of var2 for the same dates, IF the sum for var1 was also calculated. Then I should save those means, sums and dates to another data frame so that those ignored dates aren't there.

I have tried all kinds of loop structures, but I get confused by the fact that mean and sum have to be calculated for the dates where there are not NAs at all. Also saving the dates, means and sums gets me into difficulties because I have no idea how to do the indexing properly.

so expected output from this sample data should look like

  date      sum(var1)  mean(var2)
1 2003-10-30    4.8         480.75
2 2002-05-07    17.4        561.75

2 Answers2

1

Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'date' if the number of NA values in 'var1' is less than 3, then get the sum of 'var1' and mean of 'var2'.

library(data.table)
setDT(df1)[,if(sum(is.na(var1)) < 3) .(Sum = sum(var1, na.rm=TRUE), 
           Mean = mean(var2, na.rm=TRUE)) , by = date]
#          date  Sum   Mean
#1: 2003-10-30  4.4 481.75
#2: 2002-05-07 12.6 597.00
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Using dplyr. Assuming your original dataset is df

library(dplyr)
df %>% group_by(date) %>% filter(sum(is.na(var1)) <= 2)%>% summarise(Sum = sum(var1, na.rm = T), Mean = mean(var2, na.rm = T))

Data

df <- read.table(text = "         date var1 var2
1  2003-10-28  1.2  970
2  2003-10-28   NA  510
3  2003-10-28   NA  640
4  2003-10-28   NA  730
5  2003-10-30  2.0  570
6  2003-10-30   NA  480
7  2003-10-30  1.2  580
8  2003-10-30  1.2  297
9  2002-05-07  3.0  830
10 2002-05-07  4.8  507
11 2002-05-07  4.8  253
12 2002-05-07   NA  798",header =TRUE)

Output

Source: local data frame [2 x 3]

    date   Sum   Mean
  (date) (dbl)  (dbl)
1 2002-05-07  12.6 597.00
2 2003-10-30   4.4 481.75
user2100721
  • 3,557
  • 2
  • 20
  • 29
Sumedh
  • 4,835
  • 2
  • 17
  • 32