Adding a Proportion Column with Dplyr

Question

Let's say I had the following data frame, that was also altered to include counts of a,b, and c, based on whether or not they are classified by Z = 0 or 1

X <- (1:10)
Y<-  c('a','b','a','c','b','b','a','a','c','c')
Z <- c(0,1,1,1,0,1,0,1,1,1)
test_df <- data.frame(X,Y,Z)

(the code below was provided by a stack exchange member, thank you!)

res <- test_df %>% group_by(Y,Z) %>% summarise(N=n()) %>%
  pivot_wider(names_from = Z,values_from=N,
              values_fill = 0)

How might I add a column on the right which would indicate the proportion of each of the letters for which z=1, out of all appearances of that letter? It would seem that a basic summary statement should work but I figure out how...

My expected output would be something like

  Z=0 Z=1 PropZ=1
a  2   2     .5
b  1   2     .66
c  0   3     1

akrun · Answer 1 · 2020-12-04T23:20:32.360

4

Perhaps this helps

library(dplyr)
library(tidyr)
test_df %>%
   group_by(Y, Z) %>% 
   summarise(N = n(), .groups = 'drop') %>% 
   left_join(test_df %>%
                group_by(Y) %>% 
                summarise(Prop = mean(Z == 1), .groups = 'drop')) %>% 
   pivot_wider(names_from = Z, values_from = N, values_fill = 0)

-output

# A tibble: 3 x 4
#  Y      Prop   `0`   `1`
#  <chr> <dbl> <int> <int>
#1 a     0.5       2     2
#2 b     0.667     1     2
#3 c     1         0     3

edited Dec 04 '20 at 23:20

answered Dec 03 '20 at 22:17

akrun

874,273
37
540
662

This provided me with exactly what I was looking for. Thank you. However, for my understanding, I have a few follow up questions: What is done by ```.groups = 'drop' ```Likewise, why do we do another summarize with with command in it, and why is a left join needed?. Also, in the second summarize, how does mean(Z==1) provide the proportion in question? Thank you again! – PortMadeleineCrumpet Dec 03 '20 at 23:13
1

@PortMadeleineCrumpet .`groups` part can be better explained from [here](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681). Regarding why two group by. You can use `mutate` first and then add those columns also in the group_by before summarise and remove the `left_join`. I thought it is easier to understand with a join. Then `mean` of a logical vector is the proportion. i.e. `mean(c(TRUE, TRUE, FALSE))` – akrun Dec 03 '20 at 23:24
This was helpful. This question is, again for my understanding. I've done some research on left-join, and I think I get it, what we are doing here with the left join is commanding r to left join the object 'trial_df' to the thing we are about to create, which is the trial_df, grouped by Y (organizing by a,b,c), summarizing by the mean of how many within those groups have a value of 1 for Z? If this is correct, then my next question is, how does R know to then name N and prop with 1 and 0? I didn't see the command that tells it to do that... – PortMadeleineCrumpet Dec 04 '20 at 03:31
@PortMadeleineCrumpet Here, the join is on the summarised datasets both having common column 'Y'. Naming of the 'N', and 'Prop' is from the `pivot_wider` when there is more than one column, we specify in the `values_from` – akrun Dec 04 '20 at 23:02
Thank you, I also just noticed an error if you look at Prop_0 for 'b' it says that the proportion = .67, and it should be .33. I think this may have to do with the way that N and Prop are being named. This is still confusing to me... – PortMadeleineCrumpet Dec 04 '20 at 23:05
@PortMadeleineCrumpet I think my initial approach was without seeing your expected output. If you can update in your post, I can change those – akrun Dec 04 '20 at 23:07
I apologize, but I don't understand what you mean... – PortMadeleineCrumpet Dec 04 '20 at 23:08
@PortMadeleineCrumpet If you can edit your post witth the expected output, I can crosscheck – akrun Dec 04 '20 at 23:10
I have added an expected output but you will notice that the values for prop_z=1 are correct. They are mathematically incorrect for the column in the code output....It seems like R is repeating prop(z==1) for both Prop_0 and Prop_1 – PortMadeleineCrumpet Dec 04 '20 at 23:15
1

@PortMadeleineCrumpet you can just remove the 'Prop' from `values_from` and it should give the expected output. I updated – akrun Dec 04 '20 at 23:20

score 1 · Answer 2 · answered Dec 03 '20 at 22:29

1

I am not sure if what is your expected output, but below might be some options

u <- xtabs(q ~ Y + Z, cbind(test_df, q = 1))
> u
   Z
Y   0 1
  a 2 2
  b 1 2
  c 0 3

or

> prop.table(u)
   Z
Y     0   1
  a 0.2 0.2
  b 0.1 0.2
  c 0.0 0.3

answered Dec 03 '20 at 22:29

ThomasIsCoding

96,636
9
24
81

Pal R.K. · Accepted Answer · 2020-12-06T11:09:46.677

1

  test_df %>% group_by(Y) %>%
  summarise( z0 = sum(Z == 0), z1 = sum(Z == 1) , PropZ = z1/n())

edited Dec 06 '20 at 11:09

answered Dec 05 '20 at 04:20

Pal R.K.

118
1
4

score 0 · Answer 4 · answered Dec 04 '20 at 04:31

To calculate proportions of 1 for each letter you can use rowSums.

transform(res, prop_1 = `1`/rowSums(res[-1]))

In dplyr :

library(dplyr)

res %>%
  ungroup %>%
  mutate(prop_1 = `1`/rowSums(.[-1]))

#  Y       `0`   `1` prop_1
#  <chr> <int> <int>  <dbl>
#1 a         2     2  0.5  
#2 b         1     2  0.667
#3 c         0     3  1

Adding a Proportion Column with Dplyr

4 Answers4