3

original df:

ID <- c(1,1,1,1,2,2,2,2,3,3,3,3,3)
DX <- c("A","A","B","B","C","C","A","B","A","A","A","B","B")
df <- data.frame(ID,DX)

   ID DX
1   1  A
2   1  A
3   1  B
4   1  B
5   2  C
6   2  C
7   2  A
8   2  B
9   3  A
10  3  A
11  3  A
12  3  B
13  3  B

I try to make a frequency table for DX.

tblFun <- function(x){
  tbl <- table(x)
  res <- cbind(tbl,round(prop.table(tbl)*100,2))
  colnames(res) <- c('Count','Percentage')
  res
}

do.call(rbind,lapply(df[2],tblFun))

  Count Percentage
A     6      46.15
B     5      38.46
C     2      15.38

The calculation above has the denominator 13 (which is the number of observations), but since there are only 3 distinct IDs, the denominator should be 3. i.e: 3 people had A, 3 people had B, 1 person had C, so the calculations should be like the following:

  Count Percentage
A     3      100.00
B     3      100.00
C     1      33.33

How can I transform the data frame so the calculation could be done like the above?

I would appreciate all the help there is! Thanks!

Bruh
  • 277
  • 1
  • 6

3 Answers3

3

After creating the table object, get the rowSums on rowMeans on a logical matrix

m1 <- table(df[2:1]) > 0
cbind(Count = rowSums(m1), Percentage = round(rowMeans(m1)* 100, 2))

-output

  Count Percentage
A     3     100.00
B     3     100.00
C     1      33.33
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! Could you please explain what [2:1] and >0 mean in creating m1? – Bruh Apr 12 '23 at 15:42
  • @Bruh It is just to get the output in the desired form or else we have to transpose. from `table(df)`. With `table(df)`, the ID column becomes the row attribute, by doing 2:1, we change the column order thus, DX become the row attribute – akrun Apr 12 '23 at 15:48
  • @Bruh `>0` implies that we convert the original count to a logical matrix to replace all nonzero with TRUE and zero with FALSE, when we take the rowSums, TRUE will be counted as 1 and FALSE as 0. We could also do this on the unique data and direclty get the rowSums and rowMeans i.e. `m1 <- table(unique(df[2:1]))` – akrun Apr 12 '23 at 15:50
  • Thanks for the explanation! Just a side question, is it possible to sort the final table by Count frequency? – Bruh Apr 12 '23 at 15:54
  • @Bruh You may do `out <- cbind(Count = rowSums(m1), Percentage = round(rowMeans(m1)* 100, 2));out <- out[order(-out[, "Count"]),]` – akrun Apr 12 '23 at 15:55
3

Using the dplyr package and the pipe operator %>%:

library(dplyr)

# Distinct number of IDs
nID <- n_distinct(df$ID)

df %>%
  # Remove duplicates
  distinct() %>%
  # Count number of IDs, summarise by groups in DX
  summarise(Count = n(), .by = DX) %>%
  # Calculate percentage
  mutate(Percentage = round(Count/nID*100))

P.S.: To order the output according to the "Count" column in descending order, you can add (you need to add the %>% after the last line of the previous code)

      ...   %>%
  # Sort by frequency
  arrange(desc(Count))
DrEspresso
  • 211
  • 5
  • Thanks for the answer! Very helpful. However, I was trying this code on a larger dataset and the n() in the last line seems to be incorrect. This denominator should be the distinct number of IDs. It worked on this fake dataset I created but somehow did not work on a real one. Any suggestions? – Bruh Apr 12 '23 at 16:28
  • @Bruh Yes, thank you for the feedback! You are right, I used the wrong denominator in the calculation of the percentage. To fix this, I edited my answer and added the calculation of the distinct number of ID: `nID <- n_distinct(df$ID)` – DrEspresso Apr 13 '23 at 07:17
1

Something like this:

library(dplyr) # >= 1.1.0
df %>% 
  summarize(Count = n_distinct(ID), .by=DX) %>% 
  mutate(Percentage = round(Count/max(Count)*100, 2))

 DX Count Percentage
1  A     3     100.00
2  B     3     100.00
3  C     1      33.33
TarJae
  • 72,363
  • 6
  • 19
  • 66