How to summarize multiple columns at once?

Question

I am trying to create a table with the Sample ID, raw counts, and Gene names.

In this table, a new row is created for each Sample Id to correspond to each gene name:

Sample ID	Gene A	Gene B
Sample 1	1	-
Sample 1	-	2
Sample 2	3	-
Sample 2	-	4

Rather than having numerous rows, I would like to condense them into a single row:

Sample ID	Gene A	Gene B
Sample 1	1	2
Sample 2	3	4

This is the code I have so far:

dfwide = data.wide.df %>% group_by(SampleId) %>%
summarise(Sample 1 = sum(Sample 1, na.rm = T),
Sample 2 = sum(Sample 2, na.rm = T))

I have over 1000 samples, so I was hoping to find a way to summarise all the genes at once. Any help would be appreciated!

What's actaually in your data. Do you have NA values or are there actually "-" characters? It's easier to help if you share your data in [reproducible format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) such as a `dput()` to it's clear how values are stored. Are there always two rows per sample? What if there are more? Are you always taking the sum rather than just the last value? — MrFlick, Aug 23 '21 at 17:07

r2evans · Accepted Answer · 2021-08-23T17:12:31.733

If you are always guaranteed to have the same number of Gene A as Gene B, then this might work:

library(dplyr)
dat %>%
  group_by(Sample.ID) %>%
  summarize(across(starts_with("Gene"), ~ .[. != "-"]))
# # A tibble: 2 x 3
#   Sample.ID Gene.A Gene.B
#   <chr>     <chr>  <chr> 
# 1 Sample 1  1      2     
# 2 Sample 2  3      4

I've assumed that you have literal "-" strings; if they are NA or empty "", then that conditional can be modified to account for that.

The risk here is if there are uneven numbers of genes. For example, if the data were instead

dat2
#   Sample.ID Gene.A Gene.B
# 1  Sample 1      1      -
# 2  Sample 1      -      2
# 5  Sample 1      -      3
# 3  Sample 2      3      -
# 4  Sample 2      -      4

dat2 %>%
  group_by(Sample.ID) %>%
  summarize(across(starts_with("Gene"), ~ .[. != "-"]))
# # A tibble: 3 x 3
# # Groups:   Sample.ID [2]
#   Sample.ID Gene.A Gene.B
#   <chr>     <chr>  <chr> 
# 1 Sample 1  1      2     
# 2 Sample 1  1      3     
# 3 Sample 2  3      4

You see how the 1 is repeated in multiple rows; this worked this time without error because of R's "recycling": since the number of valid strings in Gene.B is a perfect multiple of the number of valid strings in Gene.A, there is no complaint and the values are repeated. I consider recycling here to likely be improper, so perhaps not what you need.

If this is the case, it might be more appropriate to store this in "long" format:

dat %>%
  tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\\.(.*)", values_to = "Value") %>%
  filter(Value != "-")
# # A tibble: 4 x 3
#   Sample.ID Gene  Value
#   <chr>     <chr> <chr>
# 1 Sample 1  A     1    
# 2 Sample 1  B     2    
# 3 Sample 2  A     3    
# 4 Sample 2  B     4    
dat2 %>%
  tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\\.(.*)", values_to = "Value") %>%
  filter(Value != "-")
# # A tibble: 5 x 3
#   Sample.ID Gene  Value
#   <chr>     <chr> <chr>
# 1 Sample 1  A     1    
# 2 Sample 1  B     2    
# 3 Sample 1  B     3    
# 4 Sample 2  A     3    
# 5 Sample 2  B     4

which will likely require you to refactor downstream processing, but at least it is safe.

Data:

dat <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "3", "-"), Gene.B = c("-", "2", "-", "4")), class = "data.frame", row.names = c(NA, -4L))
dat2 <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "-", "3", "-"), Gene.B = c("-", "2", "3", "-", "4")), row.names = c(1L, 2L, 5L, 3L, 4L), class = "data.frame")

Hi, thank you so much for your help! I followed your approach and applied `datawide1 = data.wide.df %>% group_by(SampleId) %>% summarise(across(`14-3-3`:YES, ~ sum(.x, na.rm = TRUE)))` since my genes have different names. — skabir, Aug 23 '21 at 17:34

How to summarize multiple columns at once?

1 Answers1