1

I'm having trouble figuring out how I can possibly group variables together. for example, I want to average all of candy1, where the data column ~COUNTRY is 'United States' or 'Canada' or 'United Kingdom'. while removing/omitting all rows that include NA(null?)

note: what I'm trying to do for the data frame has over 2000 rows and 140 columns.

Have considered using a for loop but haven't been able to figure out how to properly do it.

Candy_Hierarchy <- tribble(~COUNTRY, ~candy1, ~candy2, ~candy3,
              'United States',2, 0, 1,
              'United States', 1, 2, 0,
              'United States',2, 1, 2,
              'Canada', NA, NA, NA, 
              'Canada', 2, 0, 1,
              'United Kingdom', 1, 2, 0)

into

Candy_Hierarchy <- tribble(~COUNTRY, ~candy1, ~candy2, ~candy3,
              'US, Canada, and UK', 1.6, 1, 0.8, 
              'United States',1.67, 1, 1,
              'Canada', 2, 0, 1,
              'United Kingdom', 1, 2, 0)

'US, Canada, and UK' represents the total average of 'United States', 'Canada', and 'United Kingdom averaged as a whole.

don't really have any error messages as I haven't been able to figure out a way to make it happen.

Felix Chan
  • 21
  • 4

2 Answers2

1

One solution: change the data from "wide" to "long" format using tidyr::gather, to create one column for variable name and one for values. Then you can dplyr::group_by COUNTRY and variable name, dplyr::summarise to get the mean, and tidyr::spread to convert back to wide format.

library(dplyr)
library(tidyr)

tribble(~COUNTRY, ~candy1, ~candy2, ~candy3,
              'United States',2, 0, 1,
              'United States', 1, 2, 0,
              'United States',2, 1, 2,
              'Canada', NA, NA, NA, 
              'Canada', 2, 0, 1,
              'United Kingdom', 1, 2, 0) %>% 
  gather(Var, Val, -COUNTRY) %>% 
  group_by(COUNTRY, Var) %>% 
  summarise(Mean = mean(Val, na.rm = TRUE)) %>% 
  spread(Var, Mean)

Result:

# A tibble: 3 x 4
# Groups:   COUNTRY [3]
  COUNTRY        candy1 candy2 candy3
  <chr>           <dbl>  <dbl>  <dbl>
1 Canada           2         0      1
2 United Kingdom   1         2      0
3 United States    1.67      1      1
neilfws
  • 32,751
  • 5
  • 50
  • 63
  • Why gather and spread? Couldn't you just summarise_at or summarise_all? Something like `Candy_Hierarchy %>% group_by(COUNTRY) %>% summarise_all(funs(mean), na.rm=TRUE)` – thelatemail Jun 18 '19 at 00:11
  • Good solution. But notice that the OP wants to keep rows with all `NA`. – www Jun 18 '19 at 00:15
  • 1
    oh shoot sorry for the confusion, I must of accidentally not deleted that. I actually want to remove all rows with NA. Sorry for the confusion! I have fixed the post so that it shows that i do not want NA in it. – Felix Chan Jun 18 '19 at 00:22
  • @thelatemail Yes I'm sure that would work too. – neilfws Jun 18 '19 at 00:23
  • @FelixChan, In that case, it becomes very easy. Because you don't have to take care of rows with all `NA` first. I will update my post for this. – www Jun 18 '19 at 00:25
  • @www Hi there currently in my data frame, my COUNTRY column contains 5 rows: NA, "0" , "Canada", "United States", "United Kingdom". if i wanted to remove the NA and "0" row how would I go to do that? Im terribly sorry for all the questions. I am very new to R and got thrown into doing a project with incredibly terrible data. – Felix Chan Jun 18 '19 at 00:42
  • @FelixChan You can do `Candy_Hierarchy %>% filter(!is.na(COUNTRY) & COUNTRY != "0")` first. – www Jun 18 '19 at 00:44
  • would that be a separate line of code? so it would look like this? Candy_Hierarchy %>% group_by(COUNTRY) %>% summarise_all(funs(mean), na.rm=TRUE) Candy_Hierarchy %>% filter(!is.na(COUNTRY) & COUNTRY != "0") – Felix Chan Jun 18 '19 at 00:46
  • @www Hi I just realized that I needed to create another row where it would be the total average of "Canada", "United Kingdom" and "United States" combined, do you know how i could possible achieve this? would i need to create a new row somehow? – Felix Chan Jun 18 '19 at 00:48
  • @FelixChan I think I know how to do this but it is difficult to answer it through comments. Please post a new question with reproducible example and expected output. I can take a look. – www Jun 18 '19 at 00:52
  • @www will do thanks. – Felix Chan Jun 18 '19 at 00:55
  • @www I am trying to create a new post, but it is not allowing me to do so – Felix Chan Jun 18 '19 at 01:09
1

Here is one way to do this. We can use filter_at(vars(starts_with("candy")), all_vars(is.na(.))) to remove rows with all NA. After the summarize operation, we can use bind_rows to add those rows back.

library(tidyverse)

Candy_Hierarchy2 <- Candy_Hierarchy %>% rowid_to_column() 

Candy_allNA <- Candy_Hierarchy2 %>%
  filter_at(vars(starts_with("candy")), all_vars(is.na(.))) 

Candy_Hierarchy3 <- Candy_Hierarchy2 %>% 
  anti_join(Candy_allNA, by = "rowid") %>%
  group_by(COUNTRY) %>%
  summarise_at(vars(starts_with("candy")), list(~mean(., na.rm = TRUE))) %>%
  ungroup() %>%
  bind_rows(
    Candy_allNA %>% select(-rowid)
  ) %>%
  arrange(COUNTRY)

Candy_Hierarchy3
# # A tibble: 4 x 4
# COUNTRY        candy1 candy2 candy3
#   <chr>           <dbl>  <dbl>  <dbl>
# 1 Canada           2         0      1
# 2 Canada          NA        NA     NA
# 3 United Kingdom   1         2      0
# 4 United States    1.67      1      1

Update

If no need to preserve rows with all NA, this operation becomes very easy under the tidyverse.

Candy_Hierarchy4 <- Candy_Hierarchy %>%
  group_by(COUNTRY) %>%
  summarise_at(vars(starts_with("candy")), list(~mean(., na.rm = TRUE))) %>%
  ungroup()

Candy_Hierarchy4
# # A tibble: 3 x 4
#   COUNTRY        candy1 candy2 candy3
#   <chr>           <dbl>  <dbl>  <dbl>
# 1 Canada           2         0      1
# 2 United Kingdom   1         2      0
# 3 United States    1.67      1      1
www
  • 38,575
  • 12
  • 48
  • 84
  • would it be possible to answer it here? i cant seem to create a new question. Thanks – Felix Chan Jun 18 '19 at 01:31
  • would creating a new row involve the mutate function? I am struggling to find a way to include the total average of canada, united kingdom and united states – Felix Chan Jun 18 '19 at 02:05