Relative frequencies/proportions with dplyr create new columns instead of rows

Question

This question is inspired by this and this question.

I am trying to calculate the proportion of different values within each group, but I do not want to create "new" rows for the groups but new columns.

Taking the example from the second question above. If I have the following data:

data <- structure(list(value = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L), class = structure(c(1L, 1L, 1L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("A", 
"B"), class = "factor")), .Names = c("value", "class"), class = "data.frame", row.names = c(NA, 
-16L))

I can calculate the proportion of each value (1,2,3) in each class (A,B):

data %>%
    group_by(value, class) %>%
    summarise(n = n()) %>%
    complete(class, fill = list(n = 0)) %>%
    group_by(class) %>%
    mutate(freq = n / sum(n))
# A tibble: 6 x 4
  value  class     n      freq
  <int> <fctr> <dbl>     <dbl>
1     1      A     3 0.2727273
2     1      B     3 0.6000000
3     2      A     4 0.3636364
4     2      B     2 0.4000000
5     3      A     4 0.3636364
6     3      B     0 0.0000000

However I end up with a line for each value/class pair instead I want something like this:

# some code
# A tibble: 6 x 4
   class     n      1        2         3
  <fctr> <dbl>     <dbl>    <dbl>     <dbl>
1    A     11 0.2727273  0.3636364  0.3636364
2    B     5  0.6000000  0.4000000  0.0000000

With a column for each group. I could write for loops to construct a new data frame from the old one but I am certain there is a better way. Any suggestions?

Thank you

akrun · Accepted Answer · 2020-01-06T22:31:28.717

We can use pivot_wider at the end

library(dplyr)
library(tidyr)
data %>%
    group_by(value, class) %>%
    summarise(n = n()) %>%
    complete(class, fill = list(n = 0)) %>%
    group_by(class) %>%
    mutate(freq = n / sum(n), n = sum(n)) %>% 
    pivot_wider(names_from = value, values_from = freq)
# A tibble: 2 x 5
# Groups:   class [2]
#  class     n   `1`   `2`   `3`
#  <fct> <dbl> <dbl> <dbl> <dbl>
#1 A        11 0.273 0.364 0.364
#2 B         5 0.6   0.4   0

Or as @IcecreamToucan mentioned, the complete is not needed as the pivot_wider have the option to fill with a custom value (default is NA)

data %>% 
    group_by(value, class) %>% 
    summarise(n = n()) %>%  
    group_by(class) %>%
    mutate(freq = n / sum(n), n = sum(n)) %>% 
    pivot_wider(names_from = value, values_from = freq, values_fill = list(freq = 0))

If we are using a previous version of tidyr, then use spread

data %>%
    group_by(value, class) %>%
    summarise(n = n()) %>%
    complete(class, fill = list(n = 0)) %>%
    group_by(class) %>%
    mutate(freq = n / sum(n), n = sum(n)) %>% 
    spread(value, freq)

So this looks like it is exactly what I want, but it is telling me `could not find function "pivot_wider"`. My version of tidyr is 0.8.3 so maybe I need to update it. I will work more on this after my next meeting. — goryh, Jan 06 '20 at 22:06
@goryh It is from `tidyr`. I used `packageVersion('tidyr')# [1] ‘1.0.0’` — akrun, Jan 06 '20 at 22:06

score 2 · Answer 2 · answered Jan 06 '20 at 22:24

Method using data.table::dcast instead of pivot_wider.

Line 1: Get a count (.N) for each (value, class) group, and call it n

Line 2: Make new variables within each class group:

N, the sum of the previous counts
pct, the percent of N each n makes up

Line 3: Cast to wide with class and N as the rows, value as the column names, and pct as the column elements, with empty elements set to 0.

library(magrittr) # For %>%. Not necessary if dplyr is loaded already
library(data.table)
setDT(data)

data[, .(n = .N), by = .(value, class)] %>% 
    .[, `:=`(N = sum(n), pct = n/sum(n)), by = class] %>% 
  dcast(class + N ~ value, value.var = 'pct', fill = 0)

#    class  N         1         2         3
# 1:     A 11 0.2727273 0.3636364 0.3636364
# 2:     B  5 0.6000000 0.4000000 0.0000000

Thank you. Another priority has come up so I do not have time to try your answer yet. — goryh, Jan 07 '20 at 00:17

score 0 · Answer 3 · answered Jan 07 '20 at 04:26

0

We can use count to count occurrences of value and class, group_by class, calculate the frequency and get the data in wide format.

library(dplyr)
library(tidyr)

data %>%
  count(value, class) %>%
  group_by(class) %>%
  mutate(freq = n/sum(n), n = sum(n)) %>%
  pivot_wider(names_from = value, values_from = freq, values_fill = list(freq = 0))

# class     n   `1`   `2`   `3`
#  <fct> <int> <dbl> <dbl> <dbl>
#1 A        11 0.273 0.364 0.364
#2 B         5 0.6   0.4   0

answered Jan 07 '20 at 04:26

Ronak Shah

377,200
20
156
213

This looks to be the same as akrun's asnwer. – goryh Jan 10 '20 at 19:33
@goryh I haven't used `group_by` + `summarise` and instead replaced it with `count`. – Ronak Shah Jan 11 '20 at 01:03
Ah I did not look close enough the first time. – goryh Jan 11 '20 at 17:11

Relative frequencies/proportions with dplyr create new columns instead of rows

3 Answers3