0

I am working with the R programming language.

I have the following dataset:

set.seed(123)
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)


status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  <- as.factor(status )


country <- c("A", "B", "C", "D")
country <- sample(country, 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
country  <- as.factor(country)



################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)

###################
my_data = data.frame(gender, status, disease, country)

In a previous question (R: Pivoting Grouped Frequencies In Terms of their Counts), I learned how to make a summary table that shows the (relative) disease rates for each unique combination of factors:

library(tidyverse)
my_data %>% group_by (gender, status, country, disease) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%")) -> step_1
#> `summarise()` has grouped output by 'gender', 'status', 'country'. You can
#> override using the `.groups` argument.

step_1 |>group_by(country) |>  
  pivot_wider(names_from = disease, 
              values_from = c(n:rel.freq), 
              names_prefix = "disease_") |> 
  mutate(overallPerc = (n_disease_No + n_disease_Yes)/sum(step_1$n))

  gender status    country n_disease_No n_disease_Yes rel.freq_disease_No rel.freq_disease_Yes overallPerc
   <fct>  <fct>     <fct>          <int>         <int> <chr>               <chr>                      <dbl>
 1 Female Citizen   A                308           200 61%                 39%                       0.102 
 2 Female Citizen   B                291           169 63%                 37%                       0.092 
 3 Female Citizen   C                301           228 57%                 43%                       0.106 
 4 Female Citizen   D                245           189 56%                 44%                       0.0868

Now, I am trying to convert this above code into "data.table" format to increase the efficiency of this code.

Here is my attempt:

library(data.table)

my_data <- as.data.table(my_data)

step_1 <- my_data[, .(n = .N), by = c("gender", "status", "country", "disease")][, rel.freq := paste0(round(100 * n/.N, 0), "%")]

step_1 <- dcast(step_1, gender + status + country ~ disease, value.var = c("n", "rel.freq"))[, overallPerc := (n_No + n_Yes)/.N]


    gender    status country n_No n_Yes rel.freq_No rel.freq_Yes overallPerc
 1: Female   Citizen       A  308   200        962%         625%     31.7500
 2: Female   Citizen       B  291   169        909%         528%     28.7500
 3: Female   Citizen       C  301   228        941%         712%     33.0625
 4: Female   Citizen       D  245   189        766%         591%     27.1250

However, many of these percentages are greater than 100 - can someone please show me how I can resolve this problem?

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

1

Should your n/N. instead be n/sum(n) ?

step_1 <- my_data[, .(n = .N), by = c("gender", "status", "country", "disease")][, rel.freq := paste0(round(100 * n/sum(n), 0), "%")]

EDIT: example from your above code

my_data = data.frame(gender, status, disease, country)

library(data.table)
my_data <- as.data.table(my_data)
step_1 <- my_data[, .(n = .N), by = c("gender", "status", "country", "disease")][, rel.freq := paste0(round(100 * n/sum(n), 0), "%"), by=c("disease","gender")]

NicolasH2
  • 774
  • 5
  • 20
  • Thank you so much for your answer! I tried your code and noticed that the overall "rel.freq_No" and the "rel.freq_Yes" sum to 100 for each row? – stats_noob Dec 26 '22 at 18:02
  • Perhaps I am misunderstanding something or I am not applying your code correctly? Maybe if you have time, you could please show an example? Thank you so much! – stats_noob Dec 26 '22 at 18:02
  • the `sum(n)` will take of the entire column of n. Meaning all percentages will add up to 100%. You can change that behavior by adding the `by=...` argument, listing the columns for which you would like to group the rows calculate the `sum(n)` independently. So, e.g. if you want the percentages of all disease==Yes to add up to 100% (and also of all disease==No to add up to 100%) you would write: `[, rel.freq := paste0(round(100 * n/sum(n), 0), "%"), by="disease"]`. You can name as many columns as you like, e.g.: `by=c("disease", "gender")` – NicolasH2 Dec 28 '22 at 10:12
  • @ NicholasH2: you mean something like this? – stats_noob Dec 28 '22 at 16:58
  • step_1 <- my_data[, .(n = .N), by = c("gender", "status", "country", "disease")][, rel.freq := paste0(round(100 * n/sum(n), 0), "%"), by="disease"] – stats_noob Dec 28 '22 at 16:58
  • I now get a new error: Error in `[.data.frame`(my_data, , .(n = .N), by = c("gender", "status", : unused argument (by = c("gender", "status", "country", "disease")) – stats_noob Dec 28 '22 at 16:58
  • data.frame? sounds like you didn't convert to data.table. I edited my post to include a reproducable example that worked for me – NicolasH2 Dec 29 '22 at 21:39