How can I count each of expression levels of genes across patients and compute the frequency of them

Question

I'm having a data including 566 genes and 208 patients, after several my attempt, I turn my data into something like:

this is a reproducible data of the above data frame 'df1':

structure(list(Genes = c("ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2",  "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2",  "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2",  "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2",  "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2",  "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2",  "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2",  "ERLIN2", "ERLIN2", "ERLIN2", "ERLIN2"), Patients = c("TCGA-A1-A0SE-01",  "TCGA-A1-A0SH-01", "TCGA-A1-A0SJ-01", "TCGA-A1-A0SK-01", "TCGA-A1-A0SM-01",  "TCGA-A1-A0SO-01", "TCGA-A1-A0SP-01", "TCGA-A2-A04R-01", "TCGA-A2-A0CT-01",  "TCGA-A2-A0EN-01", "TCGA-A2-A0EU-01", "TCGA-A2-A0ST-01", "TCGA-A2-A0SU-01",  "TCGA-A2-A0SV-01", "TCGA-A2-A0SW-01", "TCGA-A2-A0SX-01", "TCGA-A2-A0SY-01",  "TCGA-A2-A0T0-01", "TCGA-A2-A0T1-01", "TCGA-A2-A0T2-01", "TCGA-A2-A0T4-01",  "TCGA-A2-A0T5-01", "TCGA-A2-A0T6-01", "TCGA-A2-A0T7-01", "TCGA-A2-A0YC-01",  "TCGA-A2-A0YD-01", "TCGA-A2-A0YE-01", "TCGA-A2-A0YF-01", "TCGA-A2-A0YG-01",  "TCGA-A2-A0YH-01", "TCGA-A2-A0YI-01", "TCGA-A2-A0YJ-01", "TCGA-A2-A0YK-01",  "TCGA-A2-A0YL-01", "TCGA-A2-A0YM-01", "TCGA-A2-A0YT-01", "TCGA-A7-A0D9-01",  "TCGA-A7-A13D-01", "TCGA-A7-A13E-01", "TCGA-A7-A13F-01", "TCGA-A8-A075-01",  "TCGA-A8-A08O-01", "TCGA-A8-A0A6-01", "TCGA-A8-A0AD-01", "TCGA-AN-A0XL-01",  "TCGA-AN-A0XN-01", "TCGA-AN-A0XO-01", "TCGA-AN-A0XP-01", "TCGA-AN-A0XR-01",  "TCGA-AN-A0XS-01"), levels = structure(c(ERLIN2 = 1L, ERLIN2 = 1L,  ERLIN2 = 1L, ERLIN2 = 1L, ERLIN2 = 1L, ERLIN2 = 1L, ERLIN2 = 1L,  ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 2L, ERLIN2 = 2L, ERLIN2 = 1L,  ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 1L,  ERLIN2 = 1L, ERLIN2 = 1L, ERLIN2 = 1L, ERLIN2 = 1L, ERLIN2 = 2L,  ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 2L, ERLIN2 = 2L, ERLIN2 = 2L,  ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 1L,  ERLIN2 = 2L, ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 1L, ERLIN2 = 2L,  ERLIN2 = 1L, ERLIN2 = 2L, ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 1L,  ERLIN2 = 1L, ERLIN2 = 2L, ERLIN2 = 1L, ERLIN2 = 2L, ERLIN2 = 2L,  ERLIN2 = 1L, ERLIN2 = 2L, ERLIN2 = 1L), .Label = c("down", "up" ), class = "factor")), row.names = c(NA, -50L), class = c("grouped_df",  "tbl_df", "tbl", "data.frame"), groups = structure(list(Genes = c("ERLIN2",  "ERLIN2"), levels = structure(1:2, .Label = c("down", "up"), class = "factor"), 
    .rows = list(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L, 12L, 14L, 
    16L, 17L, 18L, 19L, 20L, 21L, 24L, 29L, 31L, 32L, 35L, 36L, 
    38L, 41L, 42L, 43L, 45L, 48L, 50L), c(8L, 10L, 11L, 13L, 
    15L, 22L, 23L, 25L, 26L, 27L, 28L, 30L, 33L, 34L, 37L, 39L, 
    40L, 44L, 46L, 47L, 49L))), row.names = c(NA, -2L), class = c("tbl_df",  "tbl", "data.frame"), .drop = TRUE))

'Genes' column: gene names

'Patients' column : patients

'levels' column: the expression levels of each of 566 genes considered, including either of two levels: 'up' or 'down' across the patients

Now, I'm wanting to create a data.frame 'df2' having a column 'total', this includes the patient total of each expression level of each gene; and a column 'frequency' = (the patient total of each expression level of each gene) / 208 (i.e., total of patients)

It looks like, for example:

Genes-----levels-----total----frequency

ERLIN2----up----------50-----24.03%

ERLIN2----down------11-----5.28%

HER2------up----------15-----7.21%

HER2------down--------45-----21.63%

....

Any helps would be appreciated. Thanks!

Please include a reproducible question as suggested here [How to ask good question](https://stackoverflow.com/help/minimal-reproducible-example) and [Reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) include your data (as a dataframe object or use dput("yourdata"), the code you have tried and your expected output. This will make it more likely to get a good answer.Please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. — rj-nirbhay, May 12 '20 at 10:47

score 1 · Accepted Answer · answered May 12 '20 at 11:00

1

here's a quick tidyverse solution with your sample data

library(tidyverse) # install.packages("tidyverse")

data %>% 
  count(Genes, levels, name = "total") %>%
  ungroup() %>% 
  mutate(frequency = total / sum(total, na.rm = TRUE))

answered May 12 '20 at 11:00

CourtesyBus

331
2
4

Thanks! It worked perfectly ! Can you give me an explantation on the role of the function 'ungroup()' in the above code lines, sir? – Huy Nguyen May 12 '20 at 15:15
1

sure! the count function applies a grouping (it's actually a short-cut to using the group_by and summarise functions), so you need to ungroup() before you do other steps, like calculating your frequency column. more info here: https://stackoverflow.com/questions/51404252/in-r-dplyr-why-do-i-need-to-ungroup-after-i-count – CourtesyBus May 14 '20 at 05:58

score 1 · Answer 2 · answered May 12 '20 at 12:38

1

Maybe you can try the base R option by table

df1out <- transform(as.data.frame.table(table(df1[-2])),
    total = Freq,
    Freq = Freq / nrow(df1)
)

answered May 12 '20 at 12:38

ThomasIsCoding

96,636
9
24
81

How can I count each of expression levels of genes across patients and compute the frequency of them

2 Answers2