Hard time with dplyr - n_distinct or spread - working with subgroups

Question

I imagine this question is duplicated, but I could not find any working answer to use dplyr in an easy and elegant way to add subgroup counts after a group_by. If this question is duplicate, please delete. If you want a code to reproduce, I´ll do that. Please, don´t click on "negative".

I´ve tried to use spread, but it was not useful, after, I´ve tried to follow the instructions here, once it helps to Count unique by group in a data frame, but it does not work. The same solution is here, but the output is strange.

What I have:

What I really want (using simple code... I imagine dplyr can handle that without having to use gather()), is to insert three new columns, for each factor level.

My code:

descritivos %>% 
  group_by(sexo) %>% 
  summarise(n=n(),Idade_media = mean(idade, na.rm=T),
  idade_sd=sd(idade, na.rm=T),
              qtde_sexo = n(),
  Proporção_sexo = n()/nrow(.),
  Pontuação_media=mean(total),
  pontuacao_sd=sd(total), n_unique = n_distinct(Escolaridade))

With this code, I was almost there, but it duplicates some output.

descritivos %>% 
  group_by(sexo, Escolaridade) %>% 
  summarise(n=n(),Idade_media = mean(idade, na.rm=T),
  idade_sd=sd(idade, na.rm=T),
              qtde_sexo = n(),
  Proporção_sexo = n()/nrow(.),
  Pontuação_media=mean(total),
  pontuacao_sd=sd(total), n_unique = n_distinct(Escolaridade)) %>% spread(Escolaridade, n)
  spread(count(Escolaridade), n, fill=0)

This is a reproducible code to work with:

library(tidyverse)
ds <- data.frame(sex=c(0,1), schooling=c("k12","high","college","university"), age=rnorm(mean=20,sd=2, n=40))

ds %>% group_by(sex, schooling) %>% 
  summarise(mean(age), n=n()) %>% spread(schooling, n)


ds %>% group_by(sex, schooling) %>% 
  summarise(n()) %>% t()

The desired output: Thanks much

Last edit:

Thanks to @Akrun, I solved my question. If you have the same, please follow this code:

descritivos %>% 
  group_by(sexo) %>%
  group_by(Escolaridade, 
  Idade_media = mean(idade, na.rm=T),
  idade_sd=sd(idade, na.rm=T),
              qtde_sexo = n(),
  Proporção_sexo = n()/nrow(.),
  Pontuação_media=mean(total),
  pontuacao_sd=sd(total), add=TRUE) %>%
  summarise(n=n()) %>% 
  spread(Escolaridade, n)

or this code to the reproducible code:

ds %>% group_by(sex) %>% 
  group_by(schooling = paste0("school", schooling), Mean = mean(age), 
           ndist = n_distinct(schooling), add = TRUE) %>% summarise(n = n()) %>% 
  spread(schooling, n)

I don't understand the question. What is your desired output and why do you need the `spread` at the end of your code chunk? — Weihuang Wong, May 28 '18 at 14:45
Please use `dput` to show a small reproducible example instead of images — akrun, May 28 '18 at 14:45
Based on the edited (with reproducible example), you showed two pipe chunks with group by. Are you looking to combine those two? Where is the output of `n_distinct(schooling)` in the combined — akrun, May 28 '18 at 15:13
Hello again, @akrun .Yes, I would like to have just one table with group_by and something like count(schooling). n_distinct(schooling) is not working to count the observations in all subgroup. — Luis, May 28 '18 at 15:17
It is not clear about the expected output. For the 0 sex row, you showed schooling values and for sex 1, it is number of rows. Those seems like two different features — akrun, May 28 '18 at 15:21
Perhaps `ds %>% group_by(sex) %>% group_by(schooling = paste0("school", schooling), Mean = mean(age), ndist = n_distinct(schooling), add = TRUE) %>% summarise(n = n()) %>% spread(schooling, n)` this helps — akrun, May 28 '18 at 15:23
Hey, @akrun, I have edit the original post to clarify my question, but the last answer you provide works !! Thanks thanks much! I spent more than 3 hours searching for that!! Thanks again! — Luis, May 28 '18 at 15:35

akrun · Accepted Answer · 2018-05-28T16:07:54.690

1

We can do this in a single chain

ds %>% 
  group_by(sex) %>%
  group_by(schooling = paste0("school", schooling), Mean = mean(age), 
                ndist = n_distinct(schooling), add = TRUE) %>% 
  summarise(n = n()) %>% 
  spread(schooling, n)

edited May 28 '18 at 16:07

answered May 28 '18 at 15:37

akrun

874,273
37
540
662

1

Thanks! It´s good to know I can have two "group_by" in a single chain!! – Luis May 28 '18 at 15:38
Just for the clarity, is it possible to use n() inside group_by sex to report as well this output ? [sorry, this output is already reported] – Luis May 28 '18 at 15:56
@Luis Yes, sure, in that case `ds %>% group_by(sex) %>% group_by(n = n(), add = TRUE) %>% group_by(schooling = paste0("school", schooling), Mean = mean(age), ndist = n_distinct(schooling), add = TRUE) %>%` Then use the `summarise` with new `n()` ?\ – akrun May 28 '18 at 16:06
since you know how to deal with dplyr, I´m trying to add a new row with means, like that: bind_rows(summarise_all(., funs(if(is.numeric(.)) mean(.) else "Total"))), but it´s not working. Could you please help me with that ? Thanks much! – Luis May 29 '18 at 15:47
@Luis with the example you showed, the below code works `ds %>% summarise_all(funs(if(is.numeric(.)) mean(.) else "Total")) %>% bind_rows(ds, .)` There will be some friendly warnings about coercing the levels as some columns are `factors` . – akrun May 29 '18 at 16:05

Hard time with dplyr - n_distinct or spread - working with subgroups

1 Answers1