I am working with the R programming language.
I have the following dataset:
set.seed(123)
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status )
country <- c("A", "B", "C", "D")
country <- sample(country, 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
country <- as.factor(country)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)
###################
my_data = data.frame(gender, status, disease, country)
I want to find out the relative percentage of each unique group of factors that have the disease vs do not have the disease.
As an example:
- What percentage of Male Immigrants from Country A have the disease vs don't have the disease
- What percentage of Male Citizens from Country A have the disease vs don't have the disease (both these percentages should add to 1)
- etc.
I tried to do this with the following code:
# https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr
library(dplyr)
step_1 = my_data %>% group_by (gender, status, country, disease) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
`summarise()` has grouped output by 'gender', 'status', 'country'. You can override using the
`.groups` argument.
# A tibble: 32 x 6
# Groups: gender, status, country [16]
gender status country disease n rel.freq
<fct> <fct> <fct> <fct> <int> <chr>
1 Female Citizen A No 285 60%
2 Female Citizen A Yes 193 40%
Now (assuming this is correct), I am trying to make some modifications to this result - this should reduce the number of rows by half (i.e. two rows in step_1 for yes and no are now combined into a single row):
step_2 = step_1 %>%
group_by(gender, status, country) %>%
summarize(disease = first(disease),
# number of people in this row that do not have the disease
n_no = sum(disease == "No"),
# number of people in this row that do have the disease
n_yes = sum(disease == "Yes"),
# relative percent of people in this row that do not have the disease
n_no_rel_freq = paste(round(sum(disease == "No") / sum(n) * 100), "%"),
# relative percent of people in this row that do have the disease
n_yes_rel_freq = paste(round(sum(disease == "Yes") / sum(n) * 100), "%"),
# overall percent of all people in this row relative to entire population
overall_percent = sum(n) / sum(step_1$n))
The code seems to run - but many of the percentages are now 0:
# A tibble: 16 x 9
# Groups: gender, status [4]
gender status country disease n_no n_yes n_no_rel_freq n_yes_rel_freq overall_percent
<fct> <fct> <fct> <fct> <int> <int> <chr> <chr> <dbl>
1 Female Citizen A No 1 0 0 % 0 % 0.102
2 Female Citizen B No 1 0 0 % 0 % 0.092
Can someone please show me how to fix this?
Thanks!
Note: The final result should look something like this
# desired result (sample)
gender status country n_no n_yes n_no_rel_freq n_yes_rel_freq total overall_percent
1 female citizen A 285 193 0.6 0.4 478 0.0956