1

Following problem is causing me a real bad headache.

I have a big dataset that looks like this.

Name   Date         C1   C2    C3    C4    C5    C6   C7
 A     2008-01-03   100
 A     2008-01-05   NA
 A     2008-01-07   120
 A     2008-02-03   NA
 A     2008-03-10   50
 A     2008-07-14   70
 A     2008-07-15   NA
 A     2009-01-03   40
 A     2009-01-05   NA
 A     2010-01-07   NA
 A     2010-03-03   30
 A     2010-03-10   20
 A     2011-07-14   10
 A     2011-07-15   NA
 B     2008-01-03   NA
 B     2008-01-05   5
 B     2008-01-07   3
 B     2008-02-03   11
 B     2008-03-10   13
 B     2008-07-14   ....

As you can see, there are a lot of NAs in my observations. The other columns look similar and the dataset has +100.000 rows. So its huge.

What I want to do is, I want aggregate my data the following way. For example C1: I want to build the monthly average for each Name and for each year and each month in a timeframe from like 2000-01 until 2012-12.

The monthly average should be calculated using the dates from each month which are available.

When the calculations are done, my dataset should look like this.

Name   Date         C1          C2    C3    C4    C5    C6   C7
 A     2008-01  monthly average
 A     2008-02  monthly average
 A     2008-03  monthly average
 A     2008-04  monthly average
 A     2008-05  monthly average
 A     2008-06  monthly average
 A     2008-07  monthly average
 A     2008-08  monthly average
 A     2008-09  monthly average
 A     2008-10  monthly average
 A     2008-11  monthly average
 A     2008-12  monthly average
 A     2009-01  monthly average

 B     2008-01  monthly average
 B     2008-02  monthly average
 B     2008-03  monthly average
 B     2008-04  monthly average
 B     2008-05  monthly average
 B     2008-06   ....

So my output data should show for each name each month of the year. And the values are either NA if the month had only NA-Values or they are the monthly average of this certain month.

For example:

   Name    Date       C1
   A     2008-01-03   100
   A     2008-01-05   NA
   A     2008-01-07   120

Here we would expect:

   Name    Date       C1
   A     2008-01   (100+120)/2 = 110

For example:

   Name    Date       C1
   A     2008-01-03   NA
   A     2008-01-05   NA
   A     2008-01-07   NA

Here we would expect:

    Name    Date       C1
    A     2008-01   NA

For example:

   Name    Date       C1
   A     2008-01-03   100
   A     2008-01-05   50
   A     2008-01-07   120

Here we would expect:

    Name    Date       C1
    A     2008-01    (100+50+120)/3 = 90

As I am relatively new to r and I dont know how to solve this, I am hoping to find someone who can tackle this and show me how something like this can be solved. I would be really thankful for your support :)

bli12blu12
  • 357
  • 2
  • 11
  • This might help https://stackoverflow.com/a/49039111/786542 – Tung May 11 '18 at 17:25
  • Please share your data as well. See more here [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Tung May 11 '18 at 17:26
  • Take a look at the `stats` package (i.e., no install). You will want to `aggregate` by (Name, Date), after formatting date as %Y-%m then pass `mean` as function. This should get you started: `aggregate(.~Name+MonthDate, data, FUN=mean, na.rm=TRUE)`. – Parfait May 11 '18 at 17:59

2 Answers2

2
library(dplyr)

#generating sample data 

data <- data.frame(Name = c(rep("A",25), rep("B",50)),
                   Date = seq(as.Date("2018-01-01"), as.Date("2020-01-12"), by = 10),
                   C1 = rep(c(100,NA,NA,NA,NA,500,320,102,412,NA,200,NA,145,800,230),5))

#grouping by Name and Month and summarizing mean of values  

data %>% 
group_by(Name, month = cut(Date, "month")) %>% 
summarise(C1 = mean(C1, na.rm = TRUE)) %>% mutate(C1 = ifelse(is.nan(C1),NA,C1))
ANS
  • 21
  • 1
  • Why do you want OP to type `C1, C2, C3` etc multiple times? – MKR May 11 '18 at 18:00
  • @MKR you are right, I will go with summarise_all instead of summarise, but I will stick to 'cut' instead of 'format' to get month from dates. This way we can use it to sort and merge with other date objects – ANS May 11 '18 at 18:21
  • @bli12blu12 do you have non-numeric data in any of the fields C1 thru C7? Are your NULL values are stored as 'NA' and not actual nulls? In that case you need to convert them using something like data$C1[data$C1=="NA"] <- NA – ANS May 15 '18 at 17:13
1

You can dplyr::summarise_all to calculate average for all columns C1,C2..etc.

First group_by on Name and YearMon and deselect Date column and then use summarise_all

library(dplyr)
library(lubridate)

#Added C2 to demonstrate calculation for multiple columns in one go.
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%  
  group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
  select(-Date) %>%
  summarise_all("mean", na.rm=TRUE)


#OR - Use summarise_at and calculate mean for all columns starting with 'C'
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>% 
  group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
  summarise_at(vars(starts_with("C")), mean, na.rm=TRUE)

# A tibble: 12 x 4
# Groups: Name [?]
   Name  YearMon     C1     C2
   <chr> <chr>    <dbl>  <dbl>
 1 A     2008-01 110    220   
 2 A     2008-02 NaN    NaN   
 3 A     2008-03  50.0  100   
 4 A     2008-07  70.0  140   
 5 A     2009-01  40.0   80.0 
 6 A     2010-01 NaN    NaN   
 7 A     2010-03  25.0   50.0 
 8 A     2011-07  10.0   20.0 
 9 B     2008-01   4.00   8.00
10 B     2008-02  11.0   22.0 
11 B     2008-03  13.0   26.0 
12 B     2008-07 NaN    NaN 

Data:

df <- read.table(text = 
"Name   Date         C1  
A     2008-01-03   100
A     2008-01-05   NA
A     2008-01-07   120
A     2008-02-03   NA
A     2008-03-10   50
A     2008-07-14   70
A     2008-07-15   NA
A     2009-01-03   40
A     2009-01-05   NA
A     2010-01-07   NA
A     2010-03-03   30
A     2010-03-10   20
A     2011-07-14   10
A     2011-07-15   NA
B     2008-01-03   NA
B     2008-01-05   5
B     2008-01-07   3
B     2008-02-03   11
B     2008-03-10   13
B     2008-07-14   NA",
header = TRUE, stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
  • 1
    @bli12blu12 Thanks to inform me. Its fairly easy to accommodate additional expectation. And its good to have a new question for that as well. – MKR May 12 '18 at 13:14