This question is a follow-up from this thread
I'd like to perform three actions on a disk frame
- Count the distinct values of the field
id
grouped by two columns (key_a and key_b) - Count the distinct values of the field
id
grouped by the first of two columns (key_a) - Add a column with the distinct values for the first column / the distinct values across both columns
This is my code
my_df <-
data.frame(
key_a = rep(letters, 384),
key_b = rep(rev(letters), 384),
id = sample(1:10^6, 9984)
)
my_df %>%
select(key_a, key_b, id) %>%
chunk_group_by(key_a, key_b) %>%
# stage one
chunk_summarize(count = n_distinct(id)) %>%
collect %>%
group_by(key_a, key_b) %>%
# stage two
mutate(count_summed = sum(count)) %>%
group_by(key_a) %>%
mutate(count_all = sum(count)) %>%
ungroup() %>%
mutate(percent_of_total = count_summed / count_all)
My data is in the format of a disk frame, not a data frame, and it has 100M rows and 8 columns.
I'm following the two step instructions described in this documentation
I'm concerned that the collect
will crash my machine since it brings everything to ram
Do I have to use collect
in order to use dplyr group bys in disk frame?