I want to do something which appears simple, but I don't have a good feel for R yet, it is a maze of twisty passages, all different.
I have a table with several variables, and I want to group on two variables ... I want a two-level hierarchical grouping, also known as a tree. This can evidently be done using the group_by function of dplyr.
And then I want to compute marginal statistics (in this case, relative frequencies) based on group counts for level 1 and level 2.
In pictures, given this table of 18 rows:
I want this table of 6 rows:
Is there a simple way to do this in dplyr? (I can do it in SQL, but ...)
Edited for example
For example, based on the nycflights13
package:
library(dplyr)
install.packages("nycflights13")
require(nycflights13)
data(flights) # contains information about flights, one flight per row
ff <- flights %>%
mutate(approx_dist = floor((distance + 999)/1000)*1000) %>%
select(carrier, approx_dist) %>%
group_by(carrier, approx_dist) %>%
summarise(n = n()) %>%
arrange(carrier, approx_dist)
This creates a tbl ff
with the number of flights for each pair of (carrier, inter-airport-distance-rounded-to-1000s):
# A tibble: 33 x 3
# Groups: carrier [16]
carrier approx_dist n
<chr> <dbl> <int>
1 9E 1000 15740
2 9E 2000 2720
3 AA 1000 9146
4 AA 2000 17210
5 AA 3000 6373
And now I would like to compute the relative frequencies for the "approx_dist" values in each "carrier" group, for example, I would like to get:
carrier approx_dist n rel_freq
<chr> <dbl> <int>
1 9E 1000 15740 15740/(15740+2720)
2 9E 2000 2720 2720/(15740+2720)