1

I have an enormous Genetic expression dataset, 200k variables (rows) and 170 obs (Columns). Below is the first couple rows/columns

    Gene    Transcript_ID   V1  V2  V3  V4  V5
1   ENSG00000000003.14  ENST00000612152.4   0   6   0   3   15
2   ENSG00000000003.14  ENST00000373020.8   4   0   5   0   0
3   ENSG00000000003.14  ENST00000614008.4   0   0   0   0   0
4   ENSG00000000003.14  ENST00000496771.5   0   3   0   0   7

I'm trying to group all of the data for expression by genes. I'm utilising an existing syntax for grouping a single data column by some meta-data (gene ID) and attempting to get it to run for all 170 obs. the syntax is below and should be a very simple fix.

transcript_grouped <-aggregate(res$V1, by=list(Category=res$Gene), FUN=sum)

V1 is the column name or observation/data column, Res is the whole data set, gene being the category by which I want the data grouped. This syntax works successfully for V1, but I need this to run for all columns.

I've tried creating a variable for all the column names and even pasting them in manually.

dataColumns<- dataColumns = c("V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","V29","V30","V31","V32","V33","V34","V35","V36","V37","V38","V39","V40","V41","V42","V43","V44","V45","V46","V47","V48","V49","V50","V51","V52","V53","V54","V55","V56","V57","V58","V59","V60","V61","V62","V63","V64","V65","V66","V67","V68","V69","V70","V71","V72","V73","V74","V75","V76","V77","V78","V79","V80","V81","V82","V83","V84","V85","V86","V87","V88","V89","V90","V91","V92","V93","V94","V95","V96","V97","V98","V99","V100","V101","V102","V103","V104","V105","V106","V107","V108","V109","V110","V111","V112","V113","V114","V115","V116","V117","V118","V119","V120","V121","V122","V123","V124","V125","V126","V127","V128","V129","V130","V131","V132","V133","V134","V135","V136","V137","V138","V139","V140","V141","V142","V143","V144","V145","V146","V147","V148","V149","V150","V151","V152","V153","V154","V155","V156","V157","V158","V159","V160","V161","V162","V163","V164","V165","V166") 

trans_grouped <-aggregate(res$dataColumns, by=list(Category=res$Gene), FUN=sum)

Error in aggregate.data.frame(as.data.frame(x), ...) : no rows to aggregate

How do I loop this to include all the columns please?

user438383
  • 5,716
  • 8
  • 28
  • 43
Jed Lye
  • 43
  • 5
  • How about formula style: `aggregate(. ~ Gene, Res, FUN=sum)`? See this [SO post](https://stackoverflow.com/questions/16844613/aggregate-methods-treat-missing-values-na-differently) to handle NAs (last answer shows formula style). – Parfait Jul 18 '21 at 12:49

2 Answers2

3

How about this dplyr solution:

library(dplyr)
df %>%
  group_by(Gene) %>%
  summarise(across(starts_with("V"), ~sum(.)))
# A tibble: 2 x 4
  Gene     V1    V2    V3
* <chr> <dbl> <dbl> <dbl>
1 A         4     4     7
2 B         6     4     3

Test data:

df <- data.frame(
  Gene = c("A", "B", "A", "B"),
  V1 = c(1,2,3,4),
  V2 = c(2,2,2,2),
  V3 = c(4,2,3,1)
)
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • 1
    `~ sum(.)` could be written as just `sum` . – G. Grothendieck Jul 18 '21 at 13:27
  • So following this aggregation step I then want to calculate the proportion of each transcript for each gene. So I will end with an output table which is Gene ID, Transcript ID, proportion value for v1, proportion value for v2, As such, dropping the transcript ID column means I'll be stuck afterwards. If you have any solutions for how to get the whole table in this format Id be really grateful. – – Jed Lye Jul 19 '21 at 09:55
3

With aggregate: it works if we remove the second column:

aggregate(. ~ Gene, df[-2], FUN=sum)

Output:

                Gene V1 V2 V3 V4 V5
1 ENSG00000000003.14  4  9  5  3 22

OR

We could use summarise with across from dplyr package: Credits to Chris Ruehlemann his answer was 3 minutes earlier!!!

df %>% 
  group_by(Gene) %>% 
  summarise(across(starts_with('V'), sum))

Output:

 Gene                  V1    V2    V3    V4    V5
  <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003.14     4     9     5     3    22

data:

df <- structure(list(Gene = c("ENSG00000000003.14", "ENSG00000000003.14", 
"ENSG00000000003.14", "ENSG00000000003.14"), Transcript_ID = c("ENST00000612152.4", 
"ENST00000373020.8", "ENST00000614008.4", "ENST00000496771.5"
), V1 = c(0, 4, 0, 0), V2 = c(6, 0, 0, 3), V3 = c(0, 5, 0, 0), 
V4 = c(3, 0, 0, 0), V5 = c(15, 0, 0, 7)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(Gene = structure(list(), class = c("collector_character", 
"collector")), Transcript_ID = structure(list(), class = c("collector_character", 
"collector")), V1 = structure(list(), class = c("collector_double", 
"collector")), V2 = structure(list(), class = c("collector_double", 
"collector")), V3 = structure(list(), class = c("collector_double", 
"collector")), V4 = structure(list(), class = c("collector_double", 
"collector")), V5 = structure(list(), class = c("collector_double", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"))
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • 1
    This modified version of `aggregate` is great, I didn't have much experience in this. – Anoushiravan R Jul 18 '21 at 13:12
  • @TarJae So following this aggregation step I then want to calculate the proportion of each transcript for each gene. So I will an output table which is Gene ID, Transcript ID, proportion value for v1, proportion value for v2, As such, dropping the transcript ID column means I'll be stuck afterwards. If you have any solutions for how to get the whole table in this format Id be really grateful. – Jed Lye Jul 19 '21 at 09:53
  • How would be your proportion formula? – TarJae Jul 19 '21 at 11:40
  • Something like transcript proportion= ENSTreads/ENSG reads for example if transcript reads were 6, total gene reads were 60 6/60= .1 – Jed Lye Jul 19 '21 at 14:16