2

I have a binary table like this in my R script:

>class(forCount)
[1] "table"

>forCount

                          Gene
Filename    CTX-M-27    IMI-1   IMP-39  IMP-4   KPC-2   NDM-1
batch0_01032019_ENT1    0   1   0   0   0   1
batch0_01032019_ENT2    0   0   0   0   1   1
batch0_01032019_ENT3    0   0   0   0   0   1
batch0_01032019_ENT4    0   0   0   0   0   1
batch0_01032019_ENT5    0   0   0   0   0   1
batch0_01032019_ENT6    0   0   0   0   0   1
batch0_01032019_ENT7    0   0   0   0   0   1

How do I get the following information from this?

NDM-1                  5
NDM-1&IMI-1        1
NDM-1&KPC-2      1

Edit1: Above data was dummy data. As per @RonakShah request adding dput information. This is the sample of my data in the table.

> dput(forCount)
structure(c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L), .Dim = c(6L, 16L), .Dimnames = structure(list(AssemblyFile = c("batch0_01032019_ENT1110", 
"batch0_01032019_ENT1125", "batch0_01032019_ENT1332", "batch0_01032019_ENT1349", 
"batch0_01032019_ENT1449", "batch0_01032019_ENT1607"), CPGene = c("", 
"CTX-M-27", "IMI-1", "IMP-39", "IMP-4", "KPC-2", "NDM-1", "NDM-4", 
"NDM-5", "NDM-7", "NDM-9", "OXA-181", "OXA-23", "OXA-232", "OXA-48", 
"VIM-4")), .Names = c("AssemblyFile", "CPGene")), class = "table")

From the above pasted dput data, I expect the following output, which is out of 6 samples, 5 samples have KPC-2 and 1 sample has both KPC-2 & CTX-M-27.

KPC-2                       5
KPC-2&CTX-M-27     1
  • Do you need `stack(colSums(forCount))` ? If not can you add `dput(forCount)` and explain your expected output? – Ronak Shah Apr 07 '20 at 10:26
  • Hi @RonakShah. Thank you for comment. stack(colSums(forCount)) actually did not serve my purpose. What I need is the count of samples, for either single and multiple genes, For example from above table, there are 5 samples containing only NDM-1 gene and there is 1 sample containing both NDM-1&IMI-1, 1 sample containing both NDM-1&KPC-2. Please let me know if that helps. I am going to edit to post to input dput(forCount). Thanks – Prakki Rama Apr 07 '20 at 11:11

2 Answers2

1

You could convert the table to dataframe and paste the column names in each row which has 1 as value in them and count their occurrence using table.

df <- as.data.frame.matrix(forCount)
table(apply(df, 1, function(x) paste(names(df)[which(x == 1)], collapse = " & ")))

#CTX-M-27 & KPC-2            KPC-2 
#               1                5 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you so much for the code @Ronak. Is there a simple way to format the output as shown in my question? Genes to left and numbers to the right – Prakki Rama Apr 07 '20 at 12:24
  • 1
    yes, you could add wrap `stack` around it. `stack(table(apply(df, 1, function(x) paste(names(df)[which(x == 1)], collapse = " & "))))[2:1]` – Ronak Shah Apr 07 '20 at 12:35
  • 1
    Wonderful!! This is what I exactly wanted. Thank you so much @Ronak for kindly giving me your time. I have been trying to solve this for past 8 hours. Thank you again! – Prakki Rama Apr 07 '20 at 12:40
1

We can convert the data to tibble and then use tidyverse approaches

library(dplyr)
as_tibble(forCount) %>%
    filter(n ==1) %>%
    group_by(AssemblyFile) %>% 
    summarise(CPGene = toString(CPGene)) %>%
    count(CPGene)
# A tibble: 2 x 2
#  CPGene              n
#* <chr>           <int>
#1 CTX-M-27, KPC-2     1
#2 KPC-2               5
akrun
  • 874,273
  • 37
  • 540
  • 662