I am working with the R programming language.
I have the following dataset:
set.seed(123)
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status )
country <- c("A", "B", "C", "D")
country <- sample(country, 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
country <- as.factor(country)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)
###################
my_data = data.frame(gender, status, disease, country)
In a previous question (R: Pivoting Grouped Frequencies In Terms of their Counts), I learned how to make a summary table that shows the (relative) disease rates for each unique combination of factors:
library(tidyverse)
my_data %>% group_by (gender, status, country, disease) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%")) -> step_1
#> `summarise()` has grouped output by 'gender', 'status', 'country'. You can
#> override using the `.groups` argument.
step_1 |>group_by(country) |>
pivot_wider(names_from = disease,
values_from = c(n:rel.freq),
names_prefix = "disease_") |>
mutate(overallPerc = (n_disease_No + n_disease_Yes)/sum(step_1$n))
gender status country n_disease_No n_disease_Yes rel.freq_disease_No rel.freq_disease_Yes overallPerc
<fct> <fct> <fct> <int> <int> <chr> <chr> <dbl>
1 Female Citizen A 308 200 61% 39% 0.102
2 Female Citizen B 291 169 63% 37% 0.092
3 Female Citizen C 301 228 57% 43% 0.106
4 Female Citizen D 245 189 56% 44% 0.0868
Now, I am trying to convert this above code into "data.table" format to increase the efficiency of this code.
Here is my attempt:
library(data.table)
my_data <- as.data.table(my_data)
step_1 <- my_data[, .(n = .N), by = c("gender", "status", "country", "disease")][, rel.freq := paste0(round(100 * n/.N, 0), "%")]
step_1 <- dcast(step_1, gender + status + country ~ disease, value.var = c("n", "rel.freq"))[, overallPerc := (n_No + n_Yes)/.N]
gender status country n_No n_Yes rel.freq_No rel.freq_Yes overallPerc
1: Female Citizen A 308 200 962% 625% 31.7500
2: Female Citizen B 291 169 909% 528% 28.7500
3: Female Citizen C 301 228 941% 712% 33.0625
4: Female Citizen D 245 189 766% 591% 27.1250
However, many of these percentages are greater than 100 - can someone please show me how I can resolve this problem?
Thanks!