0

I've recently picked up R programming and have been looking through some group_by/aggregate questions posted here to help me learn better. A question came to my mind earlier today on how group_by/aggregate can incorporate NA data rather than 0.

Given the table and code below (credits to max_lim for allowing me to use his data set), what happens if the field of NA exist (which does happen quite often)?

Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,NA,16,12,8,NA,31,NA,3,20,39,34,27,50,NA,NA,NA,NA)
Duck = c(12,12,6,NA,NA,NA,28,13,31,50,33,20,NA,9,19,2,NA,7)
Chicken = c(100,120,80,50,NA,10,27,31,NA,43,NA,28,37,NA,NA,NA,5,43)
Sheep = c(30,20,10,NA,16,13,10,20,20,17,48,12,30,NA,20,NA,27,49)
Horse = c(25,20,16,11,NA,12,14,NA,43,42,10,12,42,NA,16,7,NA,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)
Farm Year Cow Duck Chicken Sheep Horse
Farm 1 2020 22 12 100 30 25
Farm 1 2020 NA 12 120 20 20
Farm 1 2019 16 6 80 10 16
Farm 1 2019 12 NA 50 NA 11
Farm 1 2018 8 NA NA 16 NA
Farm 1 2018 NA NA 10 13 12
Farm 2 2020 31 28 27 10 14
Farm 2 2020 NA 13 31 20 NA
Farm 2 2019 3 31 NA 20 43
Farm 2 2019 20 50 43 17 42
Farm 2 2018 39 33 NA 48 10
Farm 2 2018 34 20 28 12 12
Farm 3 2020 27 NA 37 30 42
Farm 3 2020 50 9 NA NA NA
Farm 3 2019 NA 19 NA 20 16
Farm 3 2019 NA 2 NA NA 7
Farm 3 2018 NA NA 5 27 NA
Farm 3 2018 NA 7 43 49 42

If I were to use aggregate(.~Farms + Year, Data, mean) here, I would get Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate which I assume is because the mean function isn't able to account for NA.

Does anyone know how we can modify the aggregate/group_by function to account for the NA by calculating the average using only years without NA data? i.e. 2020: 10, 2019: NA, 2018:20, 2017:NA, 2016:15 -> the average (after discounting NA years 2019 and 2017) will be (10 + 20 + 15) / (3) = 15.

The ideal output will be as follow:

Farm Year Cow Duck Chicken Sheep Horse
Farm 1 2020 22 (avg = 22/1 as one entry is NA) 12 110 25 22.5
Farm 1 2019 14 6 65 10 13.5
Farm 1 2018 8 N.A. (as it's all NA) 10 14.5 12
Farm 2 2020 31 20.5 29 15 14
Farm 2 2019 11.5 40.5 43 18.5 42.5
Farm 2 2018 36.5 26.5 28 30 11
Farm 3 2020 ... ... ... ... ...
Farm 3 2019 ... ... ... ... ...
Farm 3 2018 ... ... ... ... ...
Luther_Proton
  • 348
  • 1
  • 7
  • 1
    Does [this](https://stackoverflow.com/a/16844835) answer your question? – Vishal A. Jan 10 '22 at 07:27
  • Hello, tried the (na.rm=TRUE, na.action=NULL) in aggregate function previously and it did not account in the NA. It calculated the average using all the years and not the years without NA data. Thanks! – Luther_Proton Jan 10 '22 at 08:21

1 Answers1

0

Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.

library(dplyr)

Using aggregate

 Data %>% 
  aggregate(.~Year+Farms,., FUN=mean, na.rm=T, na.action=NULL) %>% 
  arrange(Farms, desc(Year)) %>% 
  as.data.frame() %>%  
  mutate_at(names(.), ~replace(., is.nan(.), NA))

Using summarize

Data %>% 
  group_by(Year, Farms) %>% 
  summarize(MeanCow = mean(Cow, na.rm=T),
            MeanDuck =  mean(Duck, na.rm=T),
            MeanChicken = mean(Chicken, na.rm=T),
            MeanSheep = mean(Sheep, na.rm=T),
            MeanHorse = mean(Horse, na.rm=T)) %>% 
  arrange(Farms, desc(Year)) %>% 
  as.data.frame() %>% 
  mutate_at(names(.), ~replace(., is.nan(.), NA))

Solution for both

      Year  Farms  Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0     110  25.0  22.5
2 2019 Farm 1 14.0  6.0      65  10.0  13.5
3 2018 Farm 1  8.0   NA      10  14.5  12.0
4 2020 Farm 2 31.0 20.5      29  15.0  14.0
5 2019 Farm 2 11.5 40.5      43  18.5  42.5
6 2018 Farm 2 36.5 26.5      28  30.0  11.0
7 2020 Farm 3 38.5  9.0      37  30.0  42.0
8 2019 Farm 3   NA 10.5      NA  20.0  11.5
9 2018 Farm 3   NA  7.0      24  38.0  42.0
JKupzig
  • 1,226
  • 3
  • 13
  • Hello, thank you for the reply! In row 2 (Sheep) there's actually no error because for 2019, the mean(NA, 10) is (in this case) equal to 10 because we have to discount away the NA (and it's year) which means the mean in this case will be 10/1 rather than 10/2 (because one of the year is NA) – Luther_Proton Jan 10 '22 at 09:44
  • Do you want to remove the NA from the table? – Yomi.blaze93 Jan 10 '22 at 09:52
  • Ideally if both of the value for the year is NA, it should be printed as N.A. – Luther_Proton Jan 10 '22 at 09:54
  • @Luther_Proton: see my edit to change all NaN to NA in the data.frame. I'm not sure if you understood me right regarding the sheep entry for row 2. In your example of the solution there is a 5, however it should be 10 as you already stated yourself. – JKupzig Jan 10 '22 at 10:14
  • @JKupzig, Thank you for that! Indeed I made a mistake with the sheep entry for row 2. Made the edits. Your code works well and it has helped greatly :) Thank you so much for your help! – Luther_Proton Jan 10 '22 at 10:28