How to transform yes/no rows into proportion with dplyr (preferably)?

Question

Here is the script:

library(dplyr)
library(ggplot2)
load("brfss2013.RData")

test <- brfss2013 %>%
  select(chcscncr,exract11) %>% 
  filter(chcscncr != "NA" , exract11 != "NA") %>% 
  group_by(exract11,chcscncr) %>% 
  summarise(count = n())

Which results in this table:

> head(test)
Source: local data frame [6 x 3]
Groups: exract11 [3]

                                                  exract11 chcscncr count
                                                    <fctr>   <fctr> <int>
1 Active Gaming Devices (Wii Fit, Dance, Dance revolution)      Yes    19
2 Active Gaming Devices (Wii Fit, Dance, Dance revolution)       No   287
3                                  Aerobics video or class      Yes   800
4                                  Aerobics video or class       No  7340
5                                              Backpacking      Yes     4
6                                              Backpacking       No    38

I would like to achieve a table that gives the "yes" proportion of each type of sport, something like:

From

Type     Ans Count
Sport A  yes 45
Sport A  no  55
Sport B  yes 34
Sport B  no  66

to:

Type      p(yes)
Sport A   0.45
Sport B   0.34

A small change to http://stackoverflow.com/questions/29549731/dplyr-finding-percentage-in-a-sub-group-using-group-by-and-summarise - something like: `dat %>% group_by(Type) %>% summarise(p_yes = Count[Ans=="yes"] / sum(Count) )` maybe? I'm no dplyr expert, and that seems a bit clunky, so I'll leave it to the geniuses here at SO to suggest something else. — thelatemail, Aug 22 '16 at 00:59

score 5 · Accepted Answer · answered Aug 22 '16 at 02:17

prop.table converts totals to proportions (in this case, just x/sum(x) for the values for each group), so for your "From" table:

brfss2013 %>%
    select(chcscncr,exract11) %>% 
    na.omit() %>%    # `==` doesn't work for NA
    count(exract11, chcscncr) %>%    # equivalent to `group_by(...) %>% summarise(n = n())`
    group_by(exract11) %>%
    mutate(pct = prop.table(n) * 100)    # `* 100` to convert to percent

## Source: local data frame [144 x 4]
## Groups: exract11 [75]
## 
##                                                    exract11 chcscncr     n      pct
##                                                      <fctr>   <fctr> <int>    <dbl>
## 1  Active Gaming Devices (Wii Fit, Dance, Dance revolution)      Yes    19  6.20915
## 2  Active Gaming Devices (Wii Fit, Dance, Dance revolution)       No   287 93.79085
## 3                                   Aerobics video or class      Yes   800  9.82801
## 4                                   Aerobics video or class       No  7340 90.17199
## 5                                               Backpacking      Yes     4  9.52381
## 6                                               Backpacking       No    38 90.47619
## 7                                                 Badminton      Yes     4 10.52632
## 8                                                 Badminton       No    34 89.47368
## 9                                                Basketball      Yes    37  1.64664
## 10                                               Basketball       No  2210 98.35336
## # ... with 134 more rows

For your "to" table, filter to just the "Yes" rows:

brfss2013 %>%
    select(chcscncr,exract11) %>% 
    na.omit() %>% 
    count(exract11, chcscncr) %>%
    group_by(exract11) %>%
    mutate(p_yes = prop.table(n)) %>%
    filter(chcscncr == "Yes")

## Source: local data frame [69 x 4]
## Groups: exract11 [69]
## 
##                                                                 exract11 chcscncr     n      p_yes
##                                                                   <fctr>   <fctr> <int>      <dbl>
## 1               Active Gaming Devices (Wii Fit, Dance, Dance revolution)      Yes    19 0.06209150
## 2                                                Aerobics video or class      Yes   800 0.09828010
## 3                                                            Backpacking      Yes     4 0.09523810
## 4                                                              Badminton      Yes     4 0.10526316
## 5                                                             Basketball      Yes    37 0.01646640
## 6                                             Bicycling machine exercise      Yes   987 0.13708333
## 7                                                              Bicycling      Yes   728 0.08519602
## 8  Boating (Canoeing, rowing, kayaking, sailing for pleasure or camping)      Yes    22 0.11518325
## 9                                                                Bowling      Yes    68 0.09985316
## 10                                                               Boxing       Yes     5 0.01633987
## # ... with 59 more rows

The proportion of "Yes" values is pretty small, as you can see from the first table.

Great answer, and thanks for the improvements. Naturally, the "yes" values are small because "yes" means "Has a doctor, nurse, or other health professional ever told you that you had skin cancer?". — luisgonzalez, Aug 22 '16 at 02:35
Ha, data makes a lot more sense when you know what it is. Maybe those numbers aren't so low... — alistaire, Aug 22 '16 at 02:51

How to transform yes/no rows into proportion with dplyr (preferably)?

1 Answers1